Token Efficiency by Design
Most AI coding tools accumulate context unboundedly: every turn, file read, error message, and retry appends to a single growing window that the model must attend over in full on the next call. PACE is architecturally different — not as a configuration choice but as a structural property of how agents communicate. Each agent is a stateless function that receives a purpose-built context, does focused work, and discards its conversational state when it finishes.
The problem with monolithic AI sessions
In a typical long AI coding session, context growth is automatic and irreversible. The system prompt occupies a fixed base. Each file the model reads, each error it retries, each approach it abandons — all of it remains in the window. A session that began solving story-1 still carries every token of that history while attempting story-3. Retries do not reset context; they compound it, adding the failed attempt’s output on top of everything that came before.
The result is that token count grows roughly quadratically with session length, while the signal-to-noise ratio falls. Relevant information — the acceptance criteria, the module interface, the test failure — is increasingly diluted by the accumulated residue of earlier turns.
Turn 1: [system prompt: 2k] + [story + files: 8k] = 10k tokensTurn 5: [system prompt: 2k] + [story + files: 8k] + [4 prior turns: 20k] = 30k tokensTurn 15: [system prompt: 2k] + [story + files: 8k] + [14 prior turns: 60k] = 70k tokensTurn 25: [system prompt: 2k] + [story + files: 8k] + [24 prior turns: 100k] = 110k tokensThis is not a model quality problem — it is a session design problem. PACE solves it by eliminating the session.
The focused context principle
Each PACE agent is a stateless function. It receives exactly the context it needs — no more, no less. When an agent finishes, its conversational state is discarded. The next agent starts fresh with a purpose-built context assembled from structured YAML artifacts produced by its predecessors. There is no shared conversation history between agents.
| Agent | Input context | Approximate input tokens |
|---|---|---|
| PRIME | plan.yaml entry + SCRIBE context docs | 6,000–10,000 |
| FORGE | Story Card + context docs + relevant source files | 15,000–50,000 |
| GATE | Story Card + Handoff Note + test runner output | 5,000–12,000 |
| SENTINEL | Story Card + Handoff Note + GATE report + targeted source | 6,000–15,000 |
| CONDUIT | Story Card + Handoff Note + SENTINEL report + CI workflow files | 5,000–10,000 |
| SCRIBE | Story Card + Handoff Note + all three review reports | 8,000–20,000 |
| Full pipeline | sequential, not simultaneous | ~45,000–117,000 total |
The key insight is that these calls are sequential, not simultaneous. The total input token count is the sum of individual calls — not a single context window that must hold everything at once. An equivalent open-ended AI session doing the same work routinely exceeds 300,000 tokens because it never resets.
Token shape per agent
The following table shows typical token usage and estimated cost per agent for a single story attempt using the default model assignment (FORGE on Sonnet, all other agents on Haiku).
| Agent | Model | Input tokens | Output tokens | Cost (typical) |
|---|---|---|---|---|
| PRIME | Haiku | 6,000–10,000 | 800–1,500 | $0.003–0.006 |
| FORGE | Sonnet | 15,000–50,000 | 5,000–20,000 (multi-turn) | $0.80–2.50 |
| GATE | Haiku | 5,000–12,000 | 1,000–2,500 | $0.003–0.008 |
| SENTINEL | Haiku | 6,000–15,000 | 1,500–3,000 | $0.004–0.009 |
| CONDUIT | Haiku | 5,000–10,000 | 1,000–2,000 | $0.003–0.006 |
| SCRIBE | Haiku | 8,000–20,000 | 2,000–5,000 | $0.005–0.013 |
| Total (1 attempt) | Mixed | ~45,000–117,000 | ~11,000–34,000 | $0.82–2.54 |
FORGE’s multi-turn loop is bounded by forge.max_iterations (default: 35). An unbounded interactive session has no equivalent guarantee — a model exploring dead ends has no automatic escape valve.
SCRIBE as token compression
SCRIBE maintains .pace/context/engineering.md — a structured, token-dense summary of the codebase that grows incrementally across the sprint. This document is the primary mechanism by which PACE avoids re-reading raw source files on every story.
On Day 1, FORGE reads source files directly to understand the codebase. On Day 2 and beyond, FORGE reads engineering.md instead.
Without SCRIBE (Day 2), FORGE must re-read the raw source to reconstruct a model of the codebase:
main.py (800 tokens) + models.py (1,200 tokens) + repository.py (900 tokens) + services.py (1,100 tokens) + tests/conftest.py (600 tokens) + ... = ~12,000 tokens of raw source to understand the module mapWith SCRIBE (Day 2), FORGE reads one document:
engineering.md (~2,500 tokens) = structured module map, interfaces, test patterns, conventions — already synthesisedThe sprint-level impact of this compression is substantial:
| Sprint day | Without SCRIBE (raw source reads) | With SCRIBE (engineering.md) | Saving |
|---|---|---|---|
| Day 1 | 12,000–50,000 tokens | 12,000–50,000 tokens (baseline) | — |
| Day 2 | 12,000–50,000 tokens | 2,000–4,000 tokens | ~85% |
| Day 3–30 | 12,000–50,000 per day | 2,000–4,000 per day | ~85% |
| Sprint total (30 days) | 360,000–1,500,000 tokens | 70,000–170,000 tokens | ~80% reduction |
SCRIBE updates engineering.md at the end of every day. It adds new modules, updated interfaces, and new patterns introduced in that day’s story. FORGE on Day N+1 starts with an accurate, token-dense model of the codebase — not a blank slate.
The analysis_model split — architectural rationale
The analysis_model configuration key exists because different agents are doing qualitatively different work, and model size should match task type — not be set uniformly for simplicity.
FORGE is doing creative work under constraints: it must generate novel code that compiles, passes tests, follows conventions, and satisfies acceptance criteria. This requires a capable model with strong reasoning. Reaching for a smaller model here trades real quality for marginal savings — the wrong trade.
GATE, SENTINEL, and CONDUIT are doing structured evaluation: given a rubric (acceptance criteria, security checklist, CI checklist) and evidence (test output, handoff note, CI logs), evaluate and produce a structured verdict. This is a retrieval and classification task, not a generative one. A smaller, faster model executes it reliably and precisely.
| Task type | Agent | Requires generative capability? | Recommended model |
|---|---|---|---|
| Code generation, TDD loop, self-correction | FORGE | Yes — creative, multi-turn | Sonnet or Opus |
| Story card generation from plan | PRIME | Moderate — structured extraction | Haiku or Sonnet |
| Criterion evaluation against test output | GATE | No — structured classification | Haiku |
| Security pattern matching and risk assessment | SENTINEL | No — checklist evaluation | Haiku |
| CI/CD configuration review | CONDUIT | No — checklist evaluation | Haiku |
| Context document synthesis | SCRIBE | Moderate — structured summarisation | Haiku or Sonnet |
Switching PRIME, GATE, SENTINEL, CONDUIT, and SCRIBE to Haiku reduces per-run cost by 40–50% with no measurable quality difference on analytical tasks. This is not a trade-off — it is the correct model assignment for the task type.
Structured handoffs as dense information
YAML artifacts (handoff.yaml, gate-report.yaml, sentinel-report.yaml) encode decisions in a structured, token-dense format. Each downstream agent reads a precise machine-readable summary rather than a prose narrative containing the same information at three times the token cost.
Conversational format — token-expensive:
The tests all passed. I ran pytest -v --tb=short and saw 12 tests passwith 0 failures. The coverage was 84%. The service correctly handlesthe error case. The integration test also passed.Structured YAML — token-efficient:
gate_decision: SHIPtest_result: exit_code: 0 tests_run: 12 tests_passed: 12 coverage_pct: 84criteria_evaluation: - criterion: "pytest exits 0" verdict: PASS evidence: "exit code 0, 12/12 tests passed"Both convey the same information. The YAML version uses approximately 60% fewer tokens and is unambiguous — SENTINEL does not need to parse intent from natural language; it reads a typed field.
The SCOPE pre-check
Before FORGE runs on a story, a lightweight Haiku call (~$0.005) predicts the FORGE cost based on the story’s acceptance criteria count and complexity indicators. If the predicted cost exceeds cost_control.max_story_cost_usd, the story is split automatically before FORGE ever starts.
This prevents the most expensive failure mode in the pipeline: a runaway FORGE loop on an oversized story that exhausts max_iterations without shipping, consuming $5–10 on a story that should have been split into two $1.50 stories. The SCOPE check costs half a cent and saves the expensive case entirely.
Cost-bounded retries
forge.max_iterations (default: 35) is a hard stop on the FORGE tool-call loop. If FORGE does not invoke complete_handoff within 35 LLM turns, the pipeline fails with a structured error. This guarantees a bounded worst-case spend per story regardless of how the model behaves.
An unbounded interactive session has no equivalent guarantee. A model that gets stuck exploring dead ends has no automatic escape valve — the user must notice and intervene.
PROGRESS.md tracks the cost of every attempt including retries and explicitly records “wasted on retries,” so you can see exactly how much budget was consumed by failed attempts versus successful ones. That visibility makes retries a measurable cost line, not a hidden variable.
Prompt caching — system prompt reuse
Anthropic’s prompt cache stores the key-value state of a content block server-side for ~5 minutes. When the next call sends an identical prefix, the KV state is loaded from cache instead of being recomputed — at 10% of the normal input-token price.
PACE applies cache_control: ephemeral to the system prompt in both complete() and chat(). The system prompt is the largest and most stable part of any call — it carries the product description, source directory rules, tool definitions, and calibration instructions. It is identical across every iteration of the same agent loop.
| Call pattern | Cache benefit |
|---|---|
| FORGE tool loop (up to 35 iterations) | Iterations 2–35 read the system prompt from cache — ~90% discount on those tokens |
| SCRIBE tool loop (up to 30 iterations) | Same — 2–30 iterations hit cache |
PLANNER (N stories × complete()) | Stories 2–N read the same system prompt from cache |
| GATE, SENTINEL, CONDUIT, PRIME | Single-turn; benefit applies only on the compact-retry path |
Cache pricing:
| Token type | Billed at | Notes |
|---|---|---|
input_tokens | 100% of input rate | Non-cached new tokens |
cache_read_input_tokens | 10% of input rate | KV state served from cache |
cache_creation_input_tokens | 125% of input rate | One-time write premium |
Break-even is 2 calls with the same system prompt. The write premium of the first call is recovered on the second.
PACE’s spend tracker separates all three token types and applies the correct pricing to each. The per-run summary() output shows cache columns when cache tokens are present:
[PACE] API usage this run: claude-sonnet-4-6: 8,120 in + 31,200 out + 112,500 cache_read + 14,200 cache_create = $0.6401 Run total: $0.6401 Cache savings this run: $0.3038 vs uncachedThe cache_savings_usd figure is the difference between what those tokens would have cost at full input price versus the actual cache-read price — a concrete measure of what caching is saving per run.
Streaming — real-time output
Both complete() and chat() use the Anthropic streaming API (messages.stream()). Tokens are printed to stdout as they are generated rather than after the entire response is assembled.
Cost impact: zero. Token count and billing are identical to non-streaming calls. The value is operational:
- Local runs: FORGE output (code, tool calls, reasoning) is visible immediately — no 30–90 second silent wait
- CI logs: GitHub Actions, GitLab, and Jenkins all show streaming stdout in real time, making it easy to spot a stuck iteration before it burns through the remaining budget
- Early error detection: a hallucinated file path or wrong tool call is visible after the first few tokens, not after the full iteration completes
FORGE context management
While PACE’s stateless architecture eliminates session-level context accumulation, FORGE itself runs a multi-turn tool-use loop within a single story. Without management, that per-story context grows monotonically — stale file reads, repeated test outputs, and write echoes compound across iterations. From v3.1.0, PACE applies four progressive stages to bound this growth.
The three growth drivers
A representative mid-sprint trace showed ~69,000 tokens in FORGE’s history at implementation start, of which ~90% was noise:
| Driver | Tokens | Share |
|---|---|---|
| Stale file reads (files already rewritten) | ~31,000 | 45% |
Repeated run_bash outputs (same command, multiple runs) | ~16,000 | 23% |
write_file echo (full file content in tool result) | ~15,000 | 22% |
| Signal content (acceptance criteria, live results) | ~7,000 | 10% |
Savings by stage
| Stage | Mechanism | Tokens saved | Cumulative reduction |
|---|---|---|---|
| 1 | Eviction + dedup + suppression (always on) | ~47,000 | ~68% |
| 2 | Haiku compression after RED phase (compression_model) | ~20,000 | ~97% |
| 3 | Pre-seeded file hints (file_hints_enabled) | reduces exploration phase | fewer tokens before GREEN |
| 4 | Forked subcontext (fork_enabled, Phase A) | ~30,000 (implementation phase) | fresh implementation baseline |
Stage 1 requires no configuration. Stages 2–4 are opt-in via forge: keys in pace.config.yaml.
See FORGE Context Efficiency for configuration, monitoring, and per-stage details.
Summary: PACE vs a monolithic session
| Property | Monolithic AI session | PACE pipeline |
|---|---|---|
| Context growth | Grows with every turn (unbounded) | Bounded per agent call |
| Context reset | Never (within session) | Every agent starts fresh |
| Codebase re-reading | Every session, every turn | Day 1 only; SCRIBE compression from Day 2 |
| Retry cost | Compounds (prior turns remain in context) | Isolated — retry starts fresh |
| Cost per feature | Unpredictable ($2–20+) | Predictable ($1.50–3.50 typical) |
| Cost visibility | None (session-level at best) | Per-agent, per-story, with retry breakdown |
| Worst-case spend | Unbounded (no escape valve) | Bounded by max_iterations × model cost |
| Prompt caching | Not applicable | System prompt cached in all agent loops; ~90% discount on repeated tokens |
| Streaming output | Varies by tool | All agents stream tokens in real time to stdout and CI logs |
| FORGE message-history management | No (unbounded growth within session) | Stage 1: automatic; Stages 2–4: configurable via forge: keys |