What a Context Engine Actually Buys You: We Ran 22 AI Code Reviewers Five Different Ways

AI Agents Context Engineering MCP Agentic Workflows

We gave the same code-review task to 22 AI agents. Same pull request, same pinned commit, same prompt, same model — the only variable was how each agent loaded the project’s rules. The cheapest configuration turned out to be the most thorough one, and the reason why says something general about context engineering.

TL;DR: A context-engine digest plus one direct read of the machine-readable policy file beat every other strategy: $1.56 per review and 9.6/13 verified findings — cheaper than reading the docs ($2.30, 8.6/13) and better than the digest alone ($1.77, 7.8/13). Reading everything scored worst of all (7.4/13). All 22 agents ran on Claude Opus 4.8, and 21 of 22 reached the same verdict.

What: a harness, a context engine, and one pull request

What’s a “harness”?

Every serious attempt to let AI agents work in a production repository grows two layers of governance.

The prose layer — conventions, architecture rules, testing standards. In our repo that’s CLAUDE.md and docs/**: “backend is snake_case,” “domain never imports infrastructure,” “all route handlers are async.” Humans read it; agents are told to read it too.

The machine-readable layer — the harness config. Ours is a single JSON file that classifies every path in the repo into risk tiers and attaches enforceable gates to each tier. CI reads it. Merge policy reads it. It’s not advice — it’s policy:

"tier3": {
  "requiredChecks": [
    "lint", "test", "build", "review-agent",
    "harness-smoke", "manual-approval", "expanded-coverage"
  ],
  "mergePolicy": {
    "minApprovals": 2,
    "requireReviewAgent": true,
    "allowSelfMerge": false
  }
}

(Terminology note: “harness” also names the agent runtime itself — the scaffolding of tools, skills, and MCP servers an agent acts through, as in harnext , “the coding agent harness.” In this post, the harness config is the repo’s policy file that such a runtime and the CI both enforce.)

A code reviewer — human or agent — can’t judge “is this PR allowed to merge?” without this file. A Tier-3 PR with the review-agent check skipped is a policy violation even if every test is green. Keep that example in mind; it decides the experiment.

Because both layers exist, the repo mandates a gate: no agent starts work before loading this context — and proving it did, via a confirmation block that reviewers check. The question this post answers is simply: what is the cheapest correct way to satisfy that gate?

Meet harnext and meaninggrid

meaninggrid is the hosted Context Engine from harnext , QualityUnit’s MIT-licensed, provider-agnostic coding-agent harness (six tools — read, write, edit, bash, skill, mcp — npm i -g harnext). The vendor’s pitch for the Context Engine is blunt: “the brain of your agent.” Sources stream into a continuously updated index — “the grid” — and per query the engine “ranks and prunes it into token-efficient context, wired straight into the harness”: continuous index, relevance ranking, dedup and cache. harnext’s headline number is −89% tokens per query on average. That is the vendor’s claim; one purpose of this experiment was to measure, with our own numbers on a real task, what that kind of compression actually saves — and what it costs.

In our deployment the grid ingests the repository’s prose documentation; each ingest produces an immutable, versioned snapshot. Agents query it over MCP (meaninggrid.harnext.dev/mcp) with a single context_research call and receive a synthesized, cited digest stamped with the snapshot_id, which the agent must cite in its confirmation block — auditable context made concrete.

Context engine pipeline: prose docs flow through ingest, versioned snapshot and context_research into the agent, while the machine-readable policy file is read directly

What the gate produces — the confirmation block (example; project specifics elided):

Loaded via: optimized hybrid (context-engine digest + policy file only).

- context_research call #1 (conventions / layering / testing / security /
  risk tiers) -> snapshot_id 9483af61cf8a40a2a0d790c7047fcf08
- context_research call #2 (LLM-provider integration checklist +
  flow-engine extra-care rules) -> snapshot_id 9483af61cf8a40a2a0d790c7047fcf08
- Read harness config (full) from disk for exact tier patterns,
  requiredChecks, mergePolicy, evidenceConfig.
  Did NOT read CLAUDE.md or docs/* (covered by the digest).

The snapshot_id is real — a reviewer can verify exactly which version of the rules the agent worked from.

Three hypotheses

The experiment was designed to settle three testable predictions, written down beforehand:

H1 — A digest is cheaper than re-reading. Ingest the prose docs once, serve every agent a compact synthesized digest, instead of every agent re-reading every document on every task. If true: meaningfully lower cost per review, at equal verdicts.

H2 — Paraphrase destroys policy. A digest can carry “Tier 3 requires human review” without loss. It cannot carry "requireReviewAgent": true without loss — the exact, quotable specifics a reviewer needs to assert a violation die in summary. If true: digest-only agents should systematically miss gate violations that agents holding the literal policy file catch.

H3 — Leaner context reads deeper. Context is paid for twice — once in dollars, once in attention: every redundant document in the window competes with the code under review. If true: reading everything (digest + all docs) should not win; the leanest sufficient context should.

How we tested it

Twenty-two agents reviewed the same Tier-3 pull request in our production monorepo (an LLM-provider integration: 44 files, +2,111 lines, real stakes — billing tables, flow-engine routing). Five arms, differing only in the context-loading step:

ArmContext loadingn
meaninggridcontext-engine digest only (2× context_research)5
diskreads 7+ docs from disk — no context engine5
hybriddigest + reads ALL the docs5
opt-hybriddigest + reads ONE file: the harness config5
coldno convention context at all (baseline)2

Ground rules: one pinned commit, one prompt body, one model — Claude Opus 4.8 — all arms interleaved in a single concurrent batch. Agents were barred from the PR’s comment thread, so earlier experiment rounds could not leak in. Every number comes from the raw agent transcripts, with token usage deduplicated per API request and priced at list prices. Quality is scored against 13 independently verified, real defects in the PR, pattern-matched in each review’s body and manually audited for false positives. Verdict agreement across all arms: 21/22 said REQUEST CHANGES.

So what: the cheapest configuration also won on quality

Scatter chart of cost per review versus verified findings: opt-hybrid is cheapest and most thorough, read-everything hybrid scores worst
ArmCost / reviewFindings (of 13)Gate findings (of 3)Wall clock
meaninggrid$1.777.80.25:34
disk$2.308.60.84:35
hybrid$1.837.40.85:40
opt-hybrid ★$1.569.61.44:55
cold$1.648.00.54:13

★ = the configuration we now ship as the repo’s default skill. Wall clock includes shared contention from running 22 agents concurrently.

H1 — confirmed

The digest-only arm reviewed for $1.77 vs $2.30 for reading the docs (−23%), and the winning digest-plus-one-file arm for $1.56 (−32%) — at equal verdicts. The saving compounds: the digest replaces a stack of documents that would otherwise ride through every subsequent API call’s context.

H2 — confirmed, decisively

The skipped review-agent check — a genuine merge-policy violation in this PR — was caught by 5 of 5 agents holding the literal policy file, and by 1 of 5 digest-only agents. The mechanism is exactly what H2 predicted: to write that finding, an agent must match exact CI check names against exact config fields — a paraphrase isn’t quotable evidence, so digest-only agents hedge and drop it. One direct read restores it.

Side by side: the harness policy requiring the review-agent check, and the CI run where that check was skipped while everything else passed

H3 — confirmed

The read-everything hybrid carried the most context of any arm and scored worst (7.4/13), while the leanest sufficient arm scored best (9.6/13) — and was best of all arms at the single deepest finding, a dead-code bug that requires tracing a call path across three files. Redundant prose didn’t add information; it competed with the code for attention.

Time anatomy of a digest agent versus a docs-reading agent: the context engine wait is visible, doc reads are fast slivers, model time dominates both

One honest footnote: the cold baseline (8.0/13 at $1.64) shows that most of the 13 defects are plain code bugs a strong model finds with no convention context at all. What cold cannot do is the policy half of the job — gates, tiers, merge rules — which is precisely where the arms separate.

Curate the prose into a digest. Read the policy file raw. Don’t read anything twice.

Full disclosure

  • Model: every API call of every agent ran on claude-opus-4-8 (Claude Opus 4.8) — verified from the model field of each transcript line, not assumed. Results may differ on other models; smaller models likely depend more on curated context, not less.
  • Prices: costs use Anthropic list prices at the time of writing; actual billing may differ. Relative comparisons are unaffected.
  • Sample size: n=5 per arm (n=2 for cold), one PR, one repository, one task type. The gate effect (5/5 vs 1/5) is sharp; per-finding rates elsewhere are ±1 agent. Treat this as a strong pilot, not a benchmark.
  • Quality metric: pattern detection over review text (citations excluded), manually audited for false positives. It counts mentions of verified defects, not overall review eloquence.
  • Timing: all 22 agents shared one machine and one API quota; wall-clock numbers include that contention.
  • We corrected ourselves twice: initial token counts were inflated 2–3× (per-line usage duplication in transcripts; fixed by request-ID dedup), and an earlier timeline visual undercounted wall time (fixed by full interval attribution). Both corrections are baked into every number here.
FlowHunt Logo

Ready to grow your business?

Start your free trial today and see results within days.

Now what: steal the loop

What we shipped

The winning arm is now the repo’s default check-context-first skill: pull the context-engine digest (two calls), then read exactly one file from disk — the harness config — and emit a confirmation block citing the snapshot and the exact gates. One measured weakness, one one-line policy fix, re-validated the same day. That loop — measure, fix the context policy, re-validate — is the part we’d encourage you to steal, whatever context engine you use.

What you can do on Monday

  1. Split your agent context in two: prose (conventions, architecture, testing) vs machine-readable policy (CI gates, risk tiers, merge rules).
  2. Digest the prose; never digest the policy. Serve the prose through a context engine — meaninggrid is ours — and make the policy file a mandatory verbatim read in your context gate.
  3. Make context auditable. Version the ingested context; require agents to cite the snapshot id in a confirmation block reviewers can actually check.
  4. Measure before you believe — including us. A handful of agents per arm on your own repository is enough to see the pattern. Score the reviews against verified findings, not vibes.

An open invitation

If you run this experiment on your own repository — same arms, your model, your harness — we would genuinely like to see your numbers, especially if they refute ours. And if your team wants help setting up a context gate like this, or wants to talk about meaninggrid and the harnext stack, reach out to the FlowHunt team or find the open-source harness at harnext.dev . Replications, questions, and corrections all welcome.

Frequently asked questions

Štefan is an AI and software engineer building FlowHunt. Beyond the product itself, he designs agentic software-engineering workflows for developers that cut development costs while raising code quality.

Štefan Moravík
Štefan Moravík
AI & Software Engineer

Build Agent Workflows That Pay for Themselves

FlowHunt helps engineering teams design agentic workflows, context gates, and MCP integrations that cut development costs while raising code quality.