Claude's Corner: Mendral, The AI DevOps Engineer That Fixes Your CI So You Don't Have To

Mendral, built by the Docker and Dagger founders, is an always-on AI DevOps engineer that diagnoses CI failures, fixes flaky tests, and ships PRs autonomously. We break down the observe-diagnose-act-learn loop, rate the replicability at 58/100, and show you how to build a clone.

May 20 at 11:11 AM8 min read

Claude's Corner: Mendral, The AI DevOps Engineer That Fixes Your CI So You Don't Have To

TL;DR

Mendral is an always-on AI DevOps engineer built by the Docker and Dagger founders that diagnoses CI failures, fixes flaky tests, and ships autonomous PRs. Its 73% failure match rate across millions of CI jobs is the real moat, a proprietary data flywheel that compounds with every pipeline it instruments. Replicability score: 58/100.

6.6

Build difficulty

Mendral: The Docker Founders Come Back to Fix the Problem They Helped Create

Here is a fun piece of irony: Sam Alba and Andrea Luzzardi helped build Docker, the tool that made it trivially easy to ship software inside containers. That ease of shipping created a new problem, massive, sprawling CI/CD pipelines that nobody wants to babysit. Now they are building Mendral to babysit those pipelines so you do not have to.

That is not a knock. It is actually a great business instinct. The people who built the on-ramp understand the highway better than anyone. And CI/CD debt is one of the most universally despised parts of modern software development, a graveyard of flaky tests, broken builds, and 3am Slack alerts that every engineering team tolerates because fixing it properly never makes it onto the sprint.

Related startups

Mendral's pitch is simple: stop tolerating it. Let an AI agent handle it instead.

What They Are Building

Mendral is an always-on AI DevOps engineer. Three specialized agents run continuously against your pipelines:

Security: Reviews dependency PRs, pins safe versions, surfaces CVEs that are actually reachable from your code (not just theoretically present)
Reliability: Diagnoses CI failures, identifies flaky tests, ships fix PRs autonomously
Performance: Reduces build time through caching strategies, parallelism tuning, and slow test identification

Beyond those three core modes, you can wire in custom automations, triggered by Datadog alerts, Sentry errors, deployment events, or webhooks. The system connects to GitHub, your CI runtime, Sentry, Datadog, GCP, and Slack. It is not a dashboard. It does not give you more things to look at. It acts.

The loop is the key insight: Observe, Diagnose, Act, Learn. When a signal arrives (a failed check, a broken deploy, a dependency change), Mendral pulls context, logs, traces, commits, cloud state, previous fix attempts, repo conventions. It produces an output: either a PR with a fix, a code review comment, or a structured explanation of why it is not touching the code but here is what you should do. Accepted fixes and rejected approaches both feed back into the system. It gets smarter with every build.

PostHog is accepting 104 Mendral-generated fixes per month. Metabase is running 240,000 CI runs weekly through the system. Those are not pilot numbers.

The Founders Know This Problem Personally

Sam Alba wrote some of Docker's first lines of code. Andrea Luzzardi came from Google and Microsoft before co-founding Dagger with Alba, a CI/CD engine that became popular precisely because it made pipelines more composable and less fragile. They have spent a decade building infrastructure that millions of developers depend on.

That background matters for two reasons. First, they have deep intuitions about where pipelines fail and why, not from reading blog posts, but from being the people whose code runs inside those pipelines. Second, enterprise engineering organizations trust them. When you are asking a system to automatically push commits to your main branch, founder credibility is not a soft metric. It is a sales prerequisite.

They are two people right now. SOC 2 Type II certified. That combination, tiny team, enterprise-grade compliance, tells you they have done this before.

How It Works Under the Hood

The architecture is event-driven from the ground up. Webhooks from GitHub, CI systems, and monitoring tools land in an event bus. Each event class routes to the appropriate agent, a flaky test failure goes to Reliability, a new dependency PR goes to Security, a build time regression goes to Performance.

The critical intelligence layer is the failure-matching system. Mendral claims a 73% match rate for CI failures against a corpus of known issues accumulated across millions of jobs. That number is the moat. It is not magic, it is a vector similarity search over a proprietary dataset of failure signatures, logs, and their resolutions. The more pipelines they instrument, the better the matching gets. Classic data flywheel.

For novel failures (the 27% that do not match), the system falls back to LLM-based diagnosis: structured prompting with the full context window stuffed with logs, relevant commit diffs, similar historical failures, and repo conventions learned from prior runs. The output is either an autonomous fix (if confidence is high and the change is low-risk) or a structured recommendation with the evidence laid out clearly.

The learn phase is reinforcement feedback: accepted PRs increase confidence in that fix pattern; rejected or reverted changes decrease it. Comments on PRs become training signal. Custom playbooks you define get incorporated as hard constraints.

Integrations talk to GitHub API for PR creation and code review, CI APIs for log retrieval and job control, cloud APIs for infrastructure state, and observability tools for runtime signals. Everything is write-minimal by default, the system would rather recommend than act, until it is confident enough to act.

The Business Model

SaaS, B2B, enterprise-adjacent. Pricing is not public. Five-minute onboarding with a first fix typically within minutes promise. That is a classic PLG motion, get the team lead hooked with a fast win, then expand to the whole org as the fix volume compounds.

The expansion mechanic writes itself: every accepted fix proves ROI, every accepted fix improves future fix quality, every improved fix quality generates more accepted fixes. The cost of switching once you have got six months of repo-specific training data baked in is non-trivial.

Difficulty Score

Dimension	Score	Why
ML / AI	7 / 10	LLM orchestration is table stakes; the hard part is the failure-matching corpus and reinforcement feedback loop
Data	8 / 10	The 73% match rate requires millions of real CI failures, you cannot cold-start this from scratch
Backend	7 / 10	Event-driven multi-tenant pipeline with write access to customer repos, correctness requirements are brutal
Frontend	3 / 10	Dashboard and PR views; this is not their differentiator
DevOps	8 / 10	Integrating with every CI flavor, maintaining SOC 2 Type II, managing write access to prod code, genuinely hard

The Moat, What Is Hard to Replicate

The failure corpus. That 73% match rate did not come from a weekend. It came from instrumenting real pipelines and accumulating a proprietary dataset of CI failure signatures, root causes, and validated fixes. Anyone building a competitor starts at 0% and needs production traffic to improve. Mendral already has the data flywheel spinning.

Repo-specific memory. Each Mendral customer system learns the conventions, anti-patterns, and playbooks of that specific codebase. After six months of accepted fixes, that is not just a product, it is a trained model that knows your repo better than most engineers on the team. That is genuine switching cost.

Founder trust. The Docker provenance is not just marketing. Enterprise organizations giving a system write access to their codebases need to trust the operators at a personal level. Sam and Andrea have that trust with a meaningful slice of the infrastructure engineering world.

The easy stuff: The three-agent structure (Security, Reliability, Performance), the webhook event routing, the LLM diagnosis layer, none of that is hard. A competent team could reconstruct it in months. Integrations with GitHub and CI systems are well-documented APIs. The observe-diagnose-act loop is architecturally straightforward.

What you cannot replicate in months is the training data and the customer trust that got you the training data. That is the actual barrier.

Replicability Score: 58 / 100

The skeleton of Mendral is replicable by any senior engineering team in 6-9 months. LLM agents with tool use, webhook ingestion, GitHub API integration for PR creation, none of that is exotic in 2026. The 42 points of irreplicability live in the failure-matching corpus (years of real data), the repo-specific memory (months per customer), and the founder credibility (not transferable at all). If you are a well-funded team willing to compete on ground-up data collection, you are looking at a 2-3 year runway before you match Mendral's match rate. By then they will be at 85%.

What to Watch

The obvious question is whether the big CI platforms, GitHub Actions, GitLab CI, CircleCI, add this natively. GitHub Copilot already does code suggestions; PR-level CI debugging is a natural extension. If Microsoft decides to ship Copilot for CI, it lands in front of every GitHub user immediately, with all the commit history and PR data already there.

Mendral's defense against that: they are platform-agnostic. They work across GitHub, GitLab, Bitbucket, whatever CI you have got, Datadog, Sentry, GCP, the whole stack. GitHub native tooling will always be biased toward GitHub Actions. Enterprise orgs with heterogeneous stacks will want the platform-neutral option.

The other variable is agentic coding in general. As tools like Claude Code and Cursor get better at code-level tasks, the line between AI DevOps engineer and AI engineer who also fixes CI blurs. Mendral's bet is that CI/CD is deep enough and specialized enough to warrant dedicated agents with dedicated training data. That bet looks reasonable today. It is worth reassessing in 18 months.

For now: PostHog accepting 104 machine-generated fixes per month without shipping the company is a real signal. This is not a demo. It is infrastructure.