Skip to content

Token Efficiency by Design

Most AI coding tools accumulate context unboundedly: every turn, file read, error message, and retry appends to a single growing window that the model must attend over in full on the next call. PACE is architecturally different — not as a configuration choice but as a structural property of how agents communicate. Each agent is a stateless function that receives a purpose-built context, does focused work, and discards its conversational state when it finishes.

The problem with monolithic AI sessions

In a typical long AI coding session, context growth is automatic and irreversible. The system prompt occupies a fixed base. Each file the model reads, each error it retries, each approach it abandons — all of it remains in the window. A session that began solving story-1 still carries every token of that history while attempting story-3. Retries do not reset context; they compound it, adding the failed attempt’s output on top of everything that came before.

The result is that token count grows roughly quadratically with session length, while the signal-to-noise ratio falls. Relevant information — the acceptance criteria, the module interface, the test failure — is increasingly diluted by the accumulated residue of earlier turns.

Turn 1: [system prompt: 2k] + [story + files: 8k] = 10k tokens
Turn 5: [system prompt: 2k] + [story + files: 8k] + [4 prior turns: 20k] = 30k tokens
Turn 15: [system prompt: 2k] + [story + files: 8k] + [14 prior turns: 60k] = 70k tokens
Turn 25: [system prompt: 2k] + [story + files: 8k] + [24 prior turns: 100k] = 110k tokens

This is not a model quality problem — it is a session design problem. PACE solves it by eliminating the session.

The focused context principle

Each PACE agent is a stateless function. It receives exactly the context it needs — no more, no less. When an agent finishes, its conversational state is discarded. The next agent starts fresh with a purpose-built context assembled from structured YAML artifacts produced by its predecessors. There is no shared conversation history between agents.

AgentInput contextApproximate input tokens
PRIMEplan.yaml entry + SCRIBE context docs6,000–10,000
FORGEStory Card + context docs + relevant source files15,000–50,000
GATEStory Card + Handoff Note + test runner output5,000–12,000
SENTINELStory Card + Handoff Note + GATE report + targeted source6,000–15,000
CONDUITStory Card + Handoff Note + SENTINEL report + CI workflow files5,000–10,000
SCRIBEStory Card + Handoff Note + all three review reports8,000–20,000
Full pipelinesequential, not simultaneous~45,000–117,000 total

The key insight is that these calls are sequential, not simultaneous. The total input token count is the sum of individual calls — not a single context window that must hold everything at once. An equivalent open-ended AI session doing the same work routinely exceeds 300,000 tokens because it never resets.

Token shape per agent

The following table shows typical token usage and estimated cost per agent for a single story attempt using the default model assignment (FORGE on Sonnet, all other agents on Haiku).

AgentModelInput tokensOutput tokensCost (typical)
PRIMEHaiku6,000–10,000800–1,500$0.003–0.006
FORGESonnet15,000–50,0005,000–20,000 (multi-turn)$0.80–2.50
GATEHaiku5,000–12,0001,000–2,500$0.003–0.008
SENTINELHaiku6,000–15,0001,500–3,000$0.004–0.009
CONDUITHaiku5,000–10,0001,000–2,000$0.003–0.006
SCRIBEHaiku8,000–20,0002,000–5,000$0.005–0.013
Total (1 attempt)Mixed~45,000–117,000~11,000–34,000$0.82–2.54

FORGE’s multi-turn loop is bounded by forge.max_iterations (default: 35). An unbounded interactive session has no equivalent guarantee — a model exploring dead ends has no automatic escape valve.

SCRIBE as token compression

SCRIBE maintains .pace/context/engineering.md — a structured, token-dense summary of the codebase that grows incrementally across the sprint. This document is the primary mechanism by which PACE avoids re-reading raw source files on every story.

On Day 1, FORGE reads source files directly to understand the codebase. On Day 2 and beyond, FORGE reads engineering.md instead.

Without SCRIBE (Day 2), FORGE must re-read the raw source to reconstruct a model of the codebase:

main.py (800 tokens) + models.py (1,200 tokens) + repository.py (900 tokens)
+ services.py (1,100 tokens) + tests/conftest.py (600 tokens) + ...
= ~12,000 tokens of raw source to understand the module map

With SCRIBE (Day 2), FORGE reads one document:

engineering.md (~2,500 tokens) = structured module map, interfaces,
test patterns, conventions — already synthesised

The sprint-level impact of this compression is substantial:

Sprint dayWithout SCRIBE (raw source reads)With SCRIBE (engineering.md)Saving
Day 112,000–50,000 tokens12,000–50,000 tokens (baseline)
Day 212,000–50,000 tokens2,000–4,000 tokens~85%
Day 3–3012,000–50,000 per day2,000–4,000 per day~85%
Sprint total (30 days)360,000–1,500,000 tokens70,000–170,000 tokens~80% reduction

SCRIBE updates engineering.md at the end of every day. It adds new modules, updated interfaces, and new patterns introduced in that day’s story. FORGE on Day N+1 starts with an accurate, token-dense model of the codebase — not a blank slate.

The analysis_model split — architectural rationale

The analysis_model configuration key exists because different agents are doing qualitatively different work, and model size should match task type — not be set uniformly for simplicity.

FORGE is doing creative work under constraints: it must generate novel code that compiles, passes tests, follows conventions, and satisfies acceptance criteria. This requires a capable model with strong reasoning. Reaching for a smaller model here trades real quality for marginal savings — the wrong trade.

GATE, SENTINEL, and CONDUIT are doing structured evaluation: given a rubric (acceptance criteria, security checklist, CI checklist) and evidence (test output, handoff note, CI logs), evaluate and produce a structured verdict. This is a retrieval and classification task, not a generative one. A smaller, faster model executes it reliably and precisely.

Task typeAgentRequires generative capability?Recommended model
Code generation, TDD loop, self-correctionFORGEYes — creative, multi-turnSonnet or Opus
Story card generation from planPRIMEModerate — structured extractionHaiku or Sonnet
Criterion evaluation against test outputGATENo — structured classificationHaiku
Security pattern matching and risk assessmentSENTINELNo — checklist evaluationHaiku
CI/CD configuration reviewCONDUITNo — checklist evaluationHaiku
Context document synthesisSCRIBEModerate — structured summarisationHaiku or Sonnet

Switching PRIME, GATE, SENTINEL, CONDUIT, and SCRIBE to Haiku reduces per-run cost by 40–50% with no measurable quality difference on analytical tasks. This is not a trade-off — it is the correct model assignment for the task type.

Structured handoffs as dense information

YAML artifacts (handoff.yaml, gate-report.yaml, sentinel-report.yaml) encode decisions in a structured, token-dense format. Each downstream agent reads a precise machine-readable summary rather than a prose narrative containing the same information at three times the token cost.

Conversational format — token-expensive:

The tests all passed. I ran pytest -v --tb=short and saw 12 tests pass
with 0 failures. The coverage was 84%. The service correctly handles
the error case. The integration test also passed.

Structured YAML — token-efficient:

gate_decision: SHIP
test_result:
exit_code: 0
tests_run: 12
tests_passed: 12
coverage_pct: 84
criteria_evaluation:
- criterion: "pytest exits 0"
verdict: PASS
evidence: "exit code 0, 12/12 tests passed"

Both convey the same information. The YAML version uses approximately 60% fewer tokens and is unambiguous — SENTINEL does not need to parse intent from natural language; it reads a typed field.

The SCOPE pre-check

Before FORGE runs on a story, a lightweight Haiku call (~$0.005) predicts the FORGE cost based on the story’s acceptance criteria count and complexity indicators. If the predicted cost exceeds cost_control.max_story_cost_usd, the story is split automatically before FORGE ever starts.

This prevents the most expensive failure mode in the pipeline: a runaway FORGE loop on an oversized story that exhausts max_iterations without shipping, consuming $5–10 on a story that should have been split into two $1.50 stories. The SCOPE check costs half a cent and saves the expensive case entirely.

Cost-bounded retries

forge.max_iterations (default: 35) is a hard stop on the FORGE tool-call loop. If FORGE does not invoke complete_handoff within 35 LLM turns, the pipeline fails with a structured error. This guarantees a bounded worst-case spend per story regardless of how the model behaves.

An unbounded interactive session has no equivalent guarantee. A model that gets stuck exploring dead ends has no automatic escape valve — the user must notice and intervene.

PROGRESS.md tracks the cost of every attempt including retries and explicitly records “wasted on retries,” so you can see exactly how much budget was consumed by failed attempts versus successful ones. That visibility makes retries a measurable cost line, not a hidden variable.

Prompt caching — system prompt reuse

Anthropic’s prompt cache stores the key-value state of a content block server-side for ~5 minutes. When the next call sends an identical prefix, the KV state is loaded from cache instead of being recomputed — at 10% of the normal input-token price.

PACE applies cache_control: ephemeral to the system prompt in both complete() and chat(). The system prompt is the largest and most stable part of any call — it carries the product description, source directory rules, tool definitions, and calibration instructions. It is identical across every iteration of the same agent loop.

Call patternCache benefit
FORGE tool loop (up to 35 iterations)Iterations 2–35 read the system prompt from cache — ~90% discount on those tokens
SCRIBE tool loop (up to 30 iterations)Same — 2–30 iterations hit cache
PLANNER (N stories × complete())Stories 2–N read the same system prompt from cache
GATE, SENTINEL, CONDUIT, PRIMESingle-turn; benefit applies only on the compact-retry path

Cache pricing:

Token typeBilled atNotes
input_tokens100% of input rateNon-cached new tokens
cache_read_input_tokens10% of input rateKV state served from cache
cache_creation_input_tokens125% of input rateOne-time write premium

Break-even is 2 calls with the same system prompt. The write premium of the first call is recovered on the second.

PACE’s spend tracker separates all three token types and applies the correct pricing to each. The per-run summary() output shows cache columns when cache tokens are present:

[PACE] API usage this run:
claude-sonnet-4-6: 8,120 in + 31,200 out + 112,500 cache_read + 14,200 cache_create = $0.6401
Run total: $0.6401
Cache savings this run: $0.3038 vs uncached

The cache_savings_usd figure is the difference between what those tokens would have cost at full input price versus the actual cache-read price — a concrete measure of what caching is saving per run.

Streaming — real-time output

Both complete() and chat() use the Anthropic streaming API (messages.stream()). Tokens are printed to stdout as they are generated rather than after the entire response is assembled.

Cost impact: zero. Token count and billing are identical to non-streaming calls. The value is operational:

  • Local runs: FORGE output (code, tool calls, reasoning) is visible immediately — no 30–90 second silent wait
  • CI logs: GitHub Actions, GitLab, and Jenkins all show streaming stdout in real time, making it easy to spot a stuck iteration before it burns through the remaining budget
  • Early error detection: a hallucinated file path or wrong tool call is visible after the first few tokens, not after the full iteration completes

FORGE context management

While PACE’s stateless architecture eliminates session-level context accumulation, FORGE itself runs a multi-turn tool-use loop within a single story. Without management, that per-story context grows monotonically — stale file reads, repeated test outputs, and write echoes compound across iterations. From v3.1.0, PACE applies four progressive stages to bound this growth.

The three growth drivers

A representative mid-sprint trace showed ~69,000 tokens in FORGE’s history at implementation start, of which ~90% was noise:

DriverTokensShare
Stale file reads (files already rewritten)~31,00045%
Repeated run_bash outputs (same command, multiple runs)~16,00023%
write_file echo (full file content in tool result)~15,00022%
Signal content (acceptance criteria, live results)~7,00010%

Savings by stage

StageMechanismTokens savedCumulative reduction
1Eviction + dedup + suppression (always on)~47,000~68%
2Haiku compression after RED phase (compression_model)~20,000~97%
3Pre-seeded file hints (file_hints_enabled)reduces exploration phasefewer tokens before GREEN
4Forked subcontext (fork_enabled, Phase A)~30,000 (implementation phase)fresh implementation baseline

Stage 1 requires no configuration. Stages 2–4 are opt-in via forge: keys in pace.config.yaml.

See FORGE Context Efficiency for configuration, monitoring, and per-stage details.


Summary: PACE vs a monolithic session

PropertyMonolithic AI sessionPACE pipeline
Context growthGrows with every turn (unbounded)Bounded per agent call
Context resetNever (within session)Every agent starts fresh
Codebase re-readingEvery session, every turnDay 1 only; SCRIBE compression from Day 2
Retry costCompounds (prior turns remain in context)Isolated — retry starts fresh
Cost per featureUnpredictable ($2–20+)Predictable ($1.50–3.50 typical)
Cost visibilityNone (session-level at best)Per-agent, per-story, with retry breakdown
Worst-case spendUnbounded (no escape valve)Bounded by max_iterations × model cost
Prompt cachingNot applicableSystem prompt cached in all agent loops; ~90% discount on repeated tokens
Streaming outputVaries by toolAll agents stream tokens in real time to stdout and CI logs
FORGE message-history managementNo (unbounded growth within session)Stage 1: automatic; Stages 2–4: configurable via forge: keys