Can coding agents really produce production-quality enterprise software?

Yes — but not unattended. In our production monorepo, 92% of May 2026's development pull requests show verifiable agent involvement, shipped under gates that got stricter over the same period: risk tiers, mandatory reviews, protected paths, and a human making every merge decision. The rules are what turn agent speed into production quality.

What is an agent harness?

An agent harness is the scaffolding a coding agent works inside: a machine-readable policy file (risk tiers, protected paths, architectural boundaries), a staged pipeline of specialized agents (tag, triage, plan, implement, review), bounded correction loops, and mandatory context loading before any code is written. harnext is QualityUnit's open-source, provider-agnostic implementation.

How much of your code is actually written by AI?

Measured from the repository itself: 92% of development PRs merged in May 2026 carry hard agent signals (attribution footers, pipeline labels, commit trailers, or the pipeline's own account as author). Every number is a floor — attribution is routinely stripped — and a manual audit of every unmarked 2026 PR found about 11% of development merges are plausibly fully hand-written.

Do humans still review the code?

Every merged change passed a human review and a human merge decision. The pipeline's job is to resolve routine quality issues before a human looks, so the human review concentrates on architecture and domain judgment — not to remove the human.

How were these adoption numbers verified?

Three independent ways: PR metadata for all 1,409 merged PRs across ten months, commit-level analysis of 5,000+ commits for co-author trailers and agent emails, and a manual forensic inspection of every unmarked 2026 PR. We then sent skeptical auditors at the three weakest months PR by PR — one number went up, one went down, one was confirmed exactly. All corrections are reflected in the published chart.

Developing a Full-Fledged Enterprise Application with the harnext Coding Agent

Ten months, 1,409 merged PRs, three forensic audits: how a staged agent pipeline took one enterprise codebase from 12% to 92% agent-involved development — with rules, gates, and a human holding every merge.

AI Agents Agentic Workflows Developer Productivity Engineering Culture

Try FlowHunt Read the Context Engine Study

“AI writes most of our code” sounds like a startup slogan. Can it be real for an enterprise application — live customers, live billing, a monorepo where a bad merge costs money? At QualityUnit it is. Here is the ten-month trail of evidence, and the rules that make it work.

TL;DR: In ten months, agent-authored work went from the first experimental PRs to 133 of 144 development PRs merged in May (92%) — verified by a three-way forensic audit of all 1,409 merged PRs, down to commit trailers and a manual inspection of every unmarked 2026 PR. It didn’t happen by “letting the AI code”: it happened by adding rules — a risk-tier harness config, a staged agent pipeline with bounded review loops, protected paths, and a human holding every merge. The rules are the product. And with a context engine feeding the agents, the same work now costs ~30% less per task (measured here ).

What it actually takes

Not a tool. A pipeline, a policy file, and a gate — run by harnext .

The pipeline: staged agents, one human

The harness is harnext — QualityUnit’s open-source, provider-agnostic coding-agent harness. In our production monorepo, every issue that enters the pipeline runs the same gauntlet of CI-triggered agent stages, its progress tracked through labels a human can read at a glance:

The production pipeline: tagger, triage, plan, implement, review with a bounded review-fix loop, an independent code-review agent, the human merge — plus doc-gardening keeping per-folder docs in sync after the merge

Two details matter more than the stage count. The loop is bounded: defects found in review go back to the implementation stage a limited number of times — agents converge or escalate to a human, they don’t thrash. Nothing starts blind: before writing a line, the implementing agent must load the project’s conventions and emit a confirmation block reviewers can check.

The policy file

The other half is a machine-readable policy: every path in the repo classified into risk tiers, each tier with enforceable gates. CI reads it; merge policy reads it; agents are briefed on it. It’s not advice:

What a high-risk change must clear: required checks, two approvals, mandatory review agent, no self-merge, protected paths, architecture boundaries, screenshot evidence — and a mandatory context confirmation

Protected paths — migrations, payments, auth — are files no agent may touch. Architectural boundaries are enforced, not suggested. Take these rules away and a coding agent is a very fast generator of plausible-looking liabilities.

Ten months, one chart

The adoption trail, measured from the repository itself.

Development pull requests merged per month, July 2025 to June 2026 — dark teal ran the full agent pipeline end-to-end, light teal is a developer pairing with the agent directly, gray is unmarked. The percentage is total agent involvement, reaching 92% in May 2026

The chart counts, for every month, how many merged development PRs carry any hard agent signal — the coding agent’s footer, the pipeline’s labels, the harness tier convention, commit co-author trailers, agent commit emails, or the pipeline’s own account as author. Dependency-bot PRs (about 8% of all merges) are excluded from the chart entirely — they’re neither human nor coding-agent work. We audited the signals three independent ways: PR metadata for all 1,409 merges, commit-level trailers across 5,000+ commits, and a manual forensic pass over every single unmarked PR of 2026. Three readings matter:

Enthusiasm fades; infrastructure sticks. The 2025 era was ad-hoc, personal adoption — and it oscillated exactly like personal habits do: 44% one month, barely 4% in November when the heaviest users paused. The harness changed the shape of the curve: within a month of the risk tiers arriving, the measured share jumped to 89%; with the full pipeline it reached 92% and stayed there. Each layer of rules increased adoption more than any individual’s enthusiasm ever did. The two shades tell the same story inside the agent share: the light band is developers pairing with the agent by hand; the dark band — work that ran the full pipeline from issue to reviewed PR — appears only when the harness lands, and by May it carries the majority of the agent work.

We inspected the remainder, PR by PR. For April–June 2026, the PRs without any marker decompose into: dependency-bot automation, agent work whose only attribution survived in commit trailers, and a residue of plausibly hand-written changes — about 11% of non-automation merges. So the honest sentence is: ~89% of real development merges in the last quarter show verifiable agent involvement — and even that is a floor, since editor-level AI assistance leaves no trace at all. We also sent skeptical auditors at the three weakest months, PR by PR: November’s count rose from 1 to 3 proven (plus 3 suspected on style), January’s fell from 10 to 8 after catching two false positives, and December was confirmed exactly — with one twist: by code volume, December’s eight marked PRs delivered 39% of that month’s inserted lines. The agent was already writing the big features; the count just couldn’t see it. Adoption also isn’t uniform: some developers run near-100% agent-assisted, a couple still mostly hand-write — the pipeline carries a growing share either way.

Quality didn’t move backwards. The same window shipped Tier-3 changes — LLM-provider integration, payment-adjacent work, an i18n expansion — under gates that got stricter over the period, not looser. And when we measured agent review consistency directly, 21 of 22 independent review agents reached the same verdict on the same PR.

So who’s the author?

The best articulation of where this leaves the human comes from an engineering thesis that studied harness-driven development on an aviation-grade project:

By the time a change reached the human author, the routine quality issues had been resolved — the author’s review concentrated on architectural and domain-level decisions. The merge was the author’s decision. Authorship of the merged code rests with the human author, regardless of which actor produced the initial draft.

— Štefan Moravík, Design and Implementation of a Drone Mission Planning Module for Airport Lighting Inspection (thesis, 2026)

That’s the deal in production too: agents do the drafting and the routine quality work; the human does architecture, domain judgment, and owns the merge.

The collaboration nobody planned

The least expected change wasn’t speed — it was what happened between developers. Everything runs on GitHub: agent PRs and human PRs sit in the same queue, carry the same labels, clear the same gates, and we review each other’s work — and the agents’ — in one flow. There is no separate “AI process” to context-switch into.

And it changed what happens to small ideas. Normally, when you stumble on a bug or think of an improvement mid-task, you write it on some list and never get back to it. Now you tell the agent, mid-flow: “file an issue for this.” The issue is created, tagged, triaged, and picked up by the pipeline while you finish what you were doing — and by the time you’re done, the fix is often already implemented, sitting in a PR, waiting for your review. The backlog of “I’ll get to it someday” quietly became a queue of things waiting for approval.

And it keeps getting cheaper

The newest layer is the context engine. Instead of every agent re-reading every convention document on every task, harnext’s meaninggrid serves a versioned, cited digest of the project’s rules — and our measured A/B test found the digest-plus-policy-file configuration was both the cheapest and the most thorough reviewer setup (−32% cost vs. reading the docs). The harness made agentic development trustworthy; the context engine is making it cheap.

The rules are the product

What to copy if you want this in your company.

What you can do on Monday

Write the rules down as config, not culture. Risk tiers, protected paths, architectural boundaries — one machine-readable file that CI enforces. A wizard can propose it; a human must review it.
Stage the pipeline. Tag → triage → plan → implement → review, each as a CI workflow with visible state. Don’t hand an agent a raw issue and hope.
Bound the review loop. A review-fix cycle with a hard iteration cap, so agents converge instead of thrashing.
Gate the context. No agent writes code before loading the project’s conventions and proving it — a confirmation block reviewers can check.
Keep the merge human. The pipeline’s job is to make human review architectural, not to remove it.
Then cut the context bill. Serve the rules through a context engine instead of raw file reads — measured: −32% per task with better results .

An open invitation

If you’re asking whether this can be real in your company, we’d genuinely like to compare notes. The harness is open source at harnext.dev , and the FlowHunt team helps companies set up exactly these pipelines. Skeptics especially welcome.

Frequently asked questions

: Yes — but not unattended. In our production monorepo, 92% of May 2026's development pull requests show verifiable agent involvement, shipped under gates that got stricter over the same period: risk tiers, mandatory reviews, protected paths, and a human making every merge decision. The rules are what turn agent speed into production quality.
: An agent harness is the scaffolding a coding agent works inside: a machine-readable policy file (risk tiers, protected paths, architectural boundaries), a staged pipeline of specialized agents (tag, triage, plan, implement, review), bounded correction loops, and mandatory context loading before any code is written. harnext is QualityUnit's open-source, provider-agnostic implementation.
: Measured from the repository itself: 92% of development PRs merged in May 2026 carry hard agent signals (attribution footers, pipeline labels, commit trailers, or the pipeline's own account as author). Every number is a floor — attribution is routinely stripped — and a manual audit of every unmarked 2026 PR found about 11% of development merges are plausibly fully hand-written.
: Every merged change passed a human review and a human merge decision. The pipeline's job is to resolve routine quality issues before a human looks, so the human review concentrates on architecture and domain judgment — not to remove the human.
: Three independent ways: PR metadata for all 1,409 merged PRs across ten months, commit-level analysis of 5,000+ commits for co-author trailers and agent emails, and a manual forensic inspection of every unmarked 2026 PR. We then sent skeptical auditors at the three weakest months PR by PR — one number went up, one went down, one was confirmed exactly. All corrections are reflected in the published chart.

Bring an Agent Pipeline to Your Team

FlowHunt helps engineering teams design agent pipelines, risk-tier gates, and context workflows that raise code quality while cutting development costs.

Try FlowHunt Read the Context Engine Study

Developing a Full-Fledged Enterprise Application with the harnext Coding Agent