What is a context engine for AI agents?

A context engine ingests a project's documentation once, versions it into an immutable snapshot, and serves agents a compact, cited digest through a single tool call — instead of every agent re-reading every document on every task. In our experiment the digest carried a snapshot_id that agents cite as verifiable evidence of which version of the rules they worked from.

Why shouldn't machine-readable policy be summarized for AI agents?

A summary can carry the meaning of prose without loss, but it destroys the exact, quotable specifics of machine-readable policy — exact CI check names, boolean flags, path patterns. An agent that only holds a paraphrase cannot confidently assert a concrete policy violation. In our test, agents holding the literal policy file caught a skipped required CI check 5 out of 5 times; digest-only agents caught it 1 out of 5 times.

Does giving an AI agent more context make it better?

No — context is paid for twice, once in tokens and once in attention. In our experiment the arm that read everything (digest plus all documentation) scored worst on quality, while the leanest sufficient configuration scored best. Every redundant document competes with the actual code for the model's attention.

Which model was used, and do the results generalize?

Every agent in the experiment ran on Claude Opus 4.8, verified from the model field of each transcript. The sample is small (five agents per arm, one pull request, one repository), so treat it as a strong pilot rather than a benchmark — and replicate it on your own repository before believing it. Smaller models likely depend more on curated context, not less.

What is the recommended context setup for coding agents?

Split agent context in two: serve the prose documentation (conventions, architecture, testing) through a context engine as a cited digest, and make the machine-readable policy file a mandatory verbatim read. Require agents to cite the snapshot id in a confirmation block so reviewers can verify which version of the rules the agent used.

What a Context Engine Actually Buys You: We Ran 22 AI Code Reviewers Five Different Ways

Same pull request, same prompt, same model — only the context loading differed. A context-engine digest plus one policy-file read beat reading the docs on both cost and quality. Full data inside.

AI Agents Context Engineering MCP Agentic Workflows

Try FlowHunt Explore MCP Development

We gave the same code-review task to 22 AI agents. Same pull request, same pinned commit, same prompt, same model — the only variable was how each agent loaded the project’s rules. The cheapest configuration turned out to be the most thorough one, and the reason why says something general about context engineering.

TL;DR: A context-engine digest plus one direct read of the machine-readable policy file beat every other strategy: $1.56 per review and 9.6/13 verified findings — cheaper than reading the docs ($2.30, 8.6/13) and better than the digest alone ($1.77, 7.8/13). Reading everything scored worst of all (7.4/13). All 22 agents ran on Claude Opus 4.8, and 21 of 22 reached the same verdict.

What: a harness, a context engine, and one pull request

What’s a “harness”?

Every serious attempt to let AI agents work in a production repository grows two layers of governance.

The prose layer — conventions, architecture rules, testing standards. In our repo that’s CLAUDE.md and docs/**: “backend is snake_case,” “domain never imports infrastructure,” “all route handlers are async.” Humans read it; agents are told to read it too.

The machine-readable layer — the harness config. Ours is a single JSON file that classifies every path in the repo into risk tiers and attaches enforceable gates to each tier. CI reads it. Merge policy reads it. It’s not advice — it’s policy:

"tier3": {
  "requiredChecks": [
    "lint", "test", "build", "review-agent",
    "harness-smoke", "manual-approval", "expanded-coverage"
  ],
  "mergePolicy": {
    "minApprovals": 2,
    "requireReviewAgent": true,
    "allowSelfMerge": false
  }
}

(Terminology note: “harness” also names the agent runtime itself — the scaffolding of tools, skills, and MCP servers an agent acts through, as in harnext , “the coding agent harness.” In this post, the harness config is the repo’s policy file that such a runtime and the CI both enforce.)

A code reviewer — human or agent — can’t judge “is this PR allowed to merge?” without this file. A Tier-3 PR with the review-agent check skipped is a policy violation even if every test is green. Keep that example in mind; it decides the experiment.

Because both layers exist, the repo mandates a gate: no agent starts work before loading this context — and proving it did, via a confirmation block that reviewers check. The question this post answers is simply: what is the cheapest correct way to satisfy that gate?

Meet harnext and meaninggrid

meaninggrid is the hosted Context Engine from harnext , QualityUnit’s MIT-licensed, provider-agnostic coding-agent harness (six tools — read, write, edit, bash, skill, mcp — npm i -g harnext). The vendor’s pitch for the Context Engine is blunt: “the brain of your agent.” Sources stream into a continuously updated index — “the grid” — and per query the engine “ranks and prunes it into token-efficient context, wired straight into the harness”: continuous index, relevance ranking, dedup and cache. harnext’s headline number is −89% tokens per query on average. That is the vendor’s claim; one purpose of this experiment was to measure, with our own numbers on a real task, what that kind of compression actually saves — and what it costs.

In our deployment the grid ingests the repository’s prose documentation; each ingest produces an immutable, versioned snapshot. Agents query it over MCP (meaninggrid.harnext.dev/mcp) with a single context_research call and receive a synthesized, cited digest stamped with the snapshot_id, which the agent must cite in its confirmation block — auditable context made concrete.

Context engine pipeline: prose docs flow through ingest, versioned snapshot and context_research into the agent, while the machine-readable policy file is read directly

What the gate produces — the confirmation block (example; project specifics elided):

Loaded via: optimized hybrid (context-engine digest + policy file only).

- context_research call #1 (conventions / layering / testing / security /
  risk tiers) -> snapshot_id 9483af61cf8a40a2a0d790c7047fcf08
- context_research call #2 (LLM-provider integration checklist +
  flow-engine extra-care rules) -> snapshot_id 9483af61cf8a40a2a0d790c7047fcf08
- Read harness config (full) from disk for exact tier patterns,
  requiredChecks, mergePolicy, evidenceConfig.
  Did NOT read CLAUDE.md or docs/* (covered by the digest).

The snapshot_id is real — a reviewer can verify exactly which version of the rules the agent worked from.

Three hypotheses

The experiment was designed to settle three testable predictions, written down beforehand:

H1 — A digest is cheaper than re-reading. Ingest the prose docs once, serve every agent a compact synthesized digest, instead of every agent re-reading every document on every task. If true: meaningfully lower cost per review, at equal verdicts.

H2 — Paraphrase destroys policy. A digest can carry “Tier 3 requires human review” without loss. It cannot carry "requireReviewAgent": true without loss — the exact, quotable specifics a reviewer needs to assert a violation die in summary. If true: digest-only agents should systematically miss gate violations that agents holding the literal policy file catch.

H3 — Leaner context reads deeper. Context is paid for twice — once in dollars, once in attention: every redundant document in the window competes with the code under review. If true: reading everything (digest + all docs) should not win; the leanest sufficient context should.

How we tested it

Twenty-two agents reviewed the same Tier-3 pull request in our production monorepo (an LLM-provider integration: 44 files, +2,111 lines, real stakes — billing tables, flow-engine routing). Five arms, differing only in the context-loading step:

Arm	Context loading	n
meaninggrid	context-engine digest only (2× `context_research`)	5
disk	reads 7+ docs from disk — no context engine	5
hybrid	digest + reads ALL the docs	5
opt-hybrid	digest + reads ONE file: the harness config	5
cold	no convention context at all (baseline)	2

Ground rules: one pinned commit, one prompt body, one model — Claude Opus 4.8 — all arms interleaved in a single concurrent batch. Agents were barred from the PR’s comment thread, so earlier experiment rounds could not leak in. Every number comes from the raw agent transcripts, with token usage deduplicated per API request and priced at list prices. Quality is scored against 13 independently verified, real defects in the PR, pattern-matched in each review’s body and manually audited for false positives. Verdict agreement across all arms: 21/22 said REQUEST CHANGES.

So what: the cheapest configuration also won on quality

Scatter chart of cost per review versus verified findings: opt-hybrid is cheapest and most thorough, read-everything hybrid scores worst

Arm	Cost / review	Findings (of 13)	Gate findings (of 3)	Wall clock
meaninggrid	$1.77	7.8	0.2	5:34
disk	$2.30	8.6	0.8	4:35
hybrid	$1.83	7.4	0.8	5:40
opt-hybrid ★	$1.56	9.6	1.4	4:55
cold	$1.64	8.0	0.5	4:13

★ = the configuration we now ship as the repo’s default skill. Wall clock includes shared contention from running 22 agents concurrently.

H1 — confirmed

The digest-only arm reviewed for $1.77 vs $2.30 for reading the docs (−23%), and the winning digest-plus-one-file arm for $1.56 (−32%) — at equal verdicts. The saving compounds: the digest replaces a stack of documents that would otherwise ride through every subsequent API call’s context.

H2 — confirmed, decisively

The skipped review-agent check — a genuine merge-policy violation in this PR — was caught by 5 of 5 agents holding the literal policy file, and by 1 of 5 digest-only agents. The mechanism is exactly what H2 predicted: to write that finding, an agent must match exact CI check names against exact config fields — a paraphrase isn’t quotable evidence, so digest-only agents hedge and drop it. One direct read restores it.

Side by side: the harness policy requiring the review-agent check, and the CI run where that check was skipped while everything else passed

H3 — confirmed

The read-everything hybrid carried the most context of any arm and scored worst (7.4/13), while the leanest sufficient arm scored best (9.6/13) — and was best of all arms at the single deepest finding, a dead-code bug that requires tracing a call path across three files. Redundant prose didn’t add information; it competed with the code for attention.

Time anatomy of a digest agent versus a docs-reading agent: the context engine wait is visible, doc reads are fast slivers, model time dominates both

One honest footnote: the cold baseline (8.0/13 at $1.64) shows that most of the 13 defects are plain code bugs a strong model finds with no convention context at all. What cold cannot do is the policy half of the job — gates, tiers, merge rules — which is precisely where the arms separate.

Curate the prose into a digest. Read the policy file raw. Don’t read anything twice.

Full disclosure

Model: every API call of every agent ran on claude-opus-4-8 (Claude Opus 4.8) — verified from the model field of each transcript line, not assumed. Results may differ on other models; smaller models likely depend more on curated context, not less.
Prices: costs use Anthropic list prices at the time of writing; actual billing may differ. Relative comparisons are unaffected.
Sample size: n=5 per arm (n=2 for cold), one PR, one repository, one task type. The gate effect (5/5 vs 1/5) is sharp; per-finding rates elsewhere are ±1 agent. Treat this as a strong pilot, not a benchmark.
Quality metric: pattern detection over review text (citations excluded), manually audited for false positives. It counts mentions of verified defects, not overall review eloquence.
Timing: all 22 agents shared one machine and one API quota; wall-clock numbers include that contention.
We corrected ourselves twice: initial token counts were inflated 2–3× (per-line usage duplication in transcripts; fixed by request-ID dedup), and an earlier timeline visual undercounted wall time (fixed by full interval attribution). Both corrections are baked into every number here.

Now what: steal the loop

What we shipped

The winning arm is now the repo’s default check-context-first skill: pull the context-engine digest (two calls), then read exactly one file from disk — the harness config — and emit a confirmation block citing the snapshot and the exact gates. One measured weakness, one one-line policy fix, re-validated the same day. That loop — measure, fix the context policy, re-validate — is the part we’d encourage you to steal, whatever context engine you use.

What you can do on Monday

Split your agent context in two: prose (conventions, architecture, testing) vs machine-readable policy (CI gates, risk tiers, merge rules).
Digest the prose; never digest the policy. Serve the prose through a context engine — meaninggrid is ours — and make the policy file a mandatory verbatim read in your context gate.
Make context auditable. Version the ingested context; require agents to cite the snapshot id in a confirmation block reviewers can actually check.
Measure before you believe — including us. A handful of agents per arm on your own repository is enough to see the pattern. Score the reviews against verified findings, not vibes.

An open invitation

If you run this experiment on your own repository — same arms, your model, your harness — we would genuinely like to see your numbers, especially if they refute ours. And if your team wants help setting up a context gate like this, or wants to talk about meaninggrid and the harnext stack, reach out to the FlowHunt team or find the open-source harness at harnext.dev . Replications, questions, and corrections all welcome.

Frequently asked questions

: A context engine ingests a project's documentation once, versions it into an immutable snapshot, and serves agents a compact, cited digest through a single tool call — instead of every agent re-reading every document on every task. In our experiment the digest carried a snapshot_id that agents cite as verifiable evidence of which version of the rules they worked from.
: A summary can carry the meaning of prose without loss, but it destroys the exact, quotable specifics of machine-readable policy — exact CI check names, boolean flags, path patterns. An agent that only holds a paraphrase cannot confidently assert a concrete policy violation. In our test, agents holding the literal policy file caught a skipped required CI check 5 out of 5 times; digest-only agents caught it 1 out of 5 times.
: No — context is paid for twice, once in tokens and once in attention. In our experiment the arm that read everything (digest plus all documentation) scored worst on quality, while the leanest sufficient configuration scored best. Every redundant document competes with the actual code for the model's attention.
: Every agent in the experiment ran on Claude Opus 4.8, verified from the model field of each transcript. The sample is small (five agents per arm, one pull request, one repository), so treat it as a strong pilot rather than a benchmark — and replicate it on your own repository before believing it. Smaller models likely depend more on curated context, not less.
: Split agent context in two: serve the prose documentation (conventions, architecture, testing) through a context engine as a cited digest, and make the machine-readable policy file a mandatory verbatim read. Require agents to cite the snapshot id in a confirmation block so reviewers can verify which version of the rules the agent used.

Build Agent Workflows That Pay for Themselves

FlowHunt helps engineering teams design agentic workflows, context gates, and MCP integrations that cut development costs while raising code quality.

Try FlowHunt Explore MCP Development

What a Context Engine Actually Buys You: We Ran 22 AI Code Reviewers Five Different Ways