Token Efficiency by Design

Most AI coding tools accumulate context unboundedly: every turn, file read, error message, and retry appends to a single growing window that the model must attend over in full on the next call. PACE is architecturally different — not as a configuration choice but as a structural property of how agents communicate. Each agent is a stateless function that receives a purpose-built context, does focused work, and discards its conversational state when it finishes.

The problem with monolithic AI sessions

In a typical long AI coding session, context growth is automatic and irreversible. The system prompt occupies a fixed base. Each file the model reads, each error it retries, each approach it abandons — all of it remains in the window. A session that began solving story-1 still carries every token of that history while attempting story-3. Retries do not reset context; they compound it, adding the failed attempt’s output on top of everything that came before.

The result is that token count grows roughly quadratically with session length, while the signal-to-noise ratio falls. Relevant information — the acceptance criteria, the module interface, the test failure — is increasingly diluted by the accumulated residue of earlier turns.

Turn 1:   [system prompt: 2k] + [story + files: 8k] = 10k tokens
Turn 5:   [system prompt: 2k] + [story + files: 8k] + [4 prior turns: 20k] = 30k tokens
Turn 15:  [system prompt: 2k] + [story + files: 8k] + [14 prior turns: 60k] = 70k tokens
Turn 25:  [system prompt: 2k] + [story + files: 8k] + [24 prior turns: 100k] = 110k tokens

This is not a model quality problem — it is a session design problem. PACE solves it by eliminating the session.

The focused context principle

Each PACE agent is a stateless function. It receives exactly the context it needs — no more, no less. When an agent finishes, its conversational state is discarded. The next agent starts fresh with a purpose-built context assembled from structured YAML artifacts produced by its predecessors. There is no shared conversation history between agents.

Agent	Input context	Approximate input tokens
PRIME	plan.yaml entry + SCRIBE context docs	6,000–10,000
FORGE	Story Card + context docs + relevant source files	15,000–50,000
GATE	Story Card + Handoff Note + test runner output	5,000–12,000
SENTINEL	Story Card + Handoff Note + GATE report + targeted source	6,000–15,000
CONDUIT	Story Card + Handoff Note + SENTINEL report + CI workflow files	5,000–10,000
SCRIBE	Story Card + Handoff Note + all three review reports	8,000–20,000
Full pipeline	sequential, not simultaneous	~45,000–117,000 total

The key insight is that these calls are sequential, not simultaneous. The total input token count is the sum of individual calls — not a single context window that must hold everything at once. An equivalent open-ended AI session doing the same work routinely exceeds 300,000 tokens because it never resets.

Token shape per agent

The following table shows typical token usage and estimated cost per agent for a single story attempt using the default model assignment (FORGE on Sonnet, all other agents on Haiku).

Agent	Model	Input tokens	Output tokens	Cost (typical)
PRIME	Haiku	6,000–10,000	800–1,500	$0.003–0.006
FORGE	Sonnet	15,000–50,000	5,000–20,000 (multi-turn)	$0.80–2.50
GATE	Haiku	5,000–12,000	1,000–2,500	$0.003–0.008
SENTINEL	Haiku	6,000–15,000	1,500–3,000	$0.004–0.009
CONDUIT	Haiku	5,000–10,000	1,000–2,000	$0.003–0.006
SCRIBE	Haiku	8,000–20,000	2,000–5,000	$0.005–0.013
Total (1 attempt)	Mixed	~45,000–117,000	~11,000–34,000	$0.82–2.54

FORGE’s multi-turn loop is bounded by forge.max_iterations (default: 35). An unbounded interactive session has no equivalent guarantee — a model exploring dead ends has no automatic escape valve.

SCRIBE as token compression

SCRIBE maintains .pace/context/engineering.md — a structured, token-dense summary of the codebase that grows incrementally across the sprint. This document is the primary mechanism by which PACE avoids re-reading raw source files on every story.

On Day 1, FORGE reads source files directly to understand the codebase. On Day 2 and beyond, FORGE reads engineering.md instead.

Without SCRIBE (Day 2), FORGE must re-read the raw source to reconstruct a model of the codebase:

main.py (800 tokens) + models.py (1,200 tokens) + repository.py (900 tokens)
  + services.py (1,100 tokens) + tests/conftest.py (600 tokens) + ...
  = ~12,000 tokens of raw source to understand the module map

With SCRIBE (Day 2), FORGE reads one document:

engineering.md (~2,500 tokens) = structured module map, interfaces,
  test patterns, conventions — already synthesised

The sprint-level impact of this compression is substantial:

Sprint day	Without SCRIBE (raw source reads)	With SCRIBE (engineering.md)	Saving
Day 1	12,000–50,000 tokens	12,000–50,000 tokens (baseline)	—
Day 2	12,000–50,000 tokens	2,000–4,000 tokens	~85%
Day 3–30	12,000–50,000 per day	2,000–4,000 per day	~85%
Sprint total (30 days)	360,000–1,500,000 tokens	70,000–170,000 tokens	~80% reduction

SCRIBE updates engineering.md at the end of every day. It adds new modules, updated interfaces, and new patterns introduced in that day’s story. FORGE on Day N+1 starts with an accurate, token-dense model of the codebase — not a blank slate.

The analysis_model split — architectural rationale

The analysis_model configuration key exists because different agents are doing qualitatively different work, and model size should match task type — not be set uniformly for simplicity.

FORGE is doing creative work under constraints: it must generate novel code that compiles, passes tests, follows conventions, and satisfies acceptance criteria. This requires a capable model with strong reasoning. Reaching for a smaller model here trades real quality for marginal savings — the wrong trade.

GATE, SENTINEL, and CONDUIT are doing structured evaluation: given a rubric (acceptance criteria, security checklist, CI checklist) and evidence (test output, handoff note, CI logs), evaluate and produce a structured verdict. This is a retrieval and classification task, not a generative one. A smaller, faster model executes it reliably and precisely.

Task type	Agent	Requires generative capability?	Recommended model
Code generation, TDD loop, self-correction	FORGE	Yes — creative, multi-turn	Sonnet or Opus
Story card generation from plan	PRIME	Moderate — structured extraction	Haiku or Sonnet
Criterion evaluation against test output	GATE	No — structured classification	Haiku
Security pattern matching and risk assessment	SENTINEL	No — checklist evaluation	Haiku
CI/CD configuration review	CONDUIT	No — checklist evaluation	Haiku
Context document synthesis	SCRIBE	Moderate — structured summarisation	Haiku or Sonnet

Switching PRIME, GATE, SENTINEL, CONDUIT, and SCRIBE to Haiku reduces per-run cost by 40–50% with no measurable quality difference on analytical tasks. This is not a trade-off — it is the correct model assignment for the task type.

Structured handoffs as dense information

YAML artifacts (handoff.yaml, gate-report.yaml, sentinel-report.yaml) encode decisions in a structured, token-dense format. Each downstream agent reads a precise machine-readable summary rather than a prose narrative containing the same information at three times the token cost.

Conversational format — token-expensive:

The tests all passed. I ran pytest -v --tb=short and saw 12 tests pass
with 0 failures. The coverage was 84%. The service correctly handles
the error case. The integration test also passed.

Structured YAML — token-efficient:

gate_decision: SHIP
test_result:
  exit_code: 0
  tests_run: 12
  tests_passed: 12
  coverage_pct: 84
criteria_evaluation:
  - criterion: "pytest exits 0"
    verdict: PASS
    evidence: "exit code 0, 12/12 tests passed"

Both convey the same information. The YAML version uses approximately 60% fewer tokens and is unambiguous — SENTINEL does not need to parse intent from natural language; it reads a typed field.

The SCOPE pre-check

Before FORGE runs on a story, a lightweight Haiku call (~$0.005) predicts the FORGE cost based on the story’s acceptance criteria count and complexity indicators. If the predicted cost exceeds cost_control.max_story_cost_usd, the story is split automatically before FORGE ever starts.

This prevents the most expensive failure mode in the pipeline: a runaway FORGE loop on an oversized story that exhausts max_iterations without shipping, consuming $5–10 on a story that should have been split into two $1.50 stories. The SCOPE check costs half a cent and saves the expensive case entirely.

Cost-bounded retries

forge.max_iterations (default: 35) is a hard stop on the FORGE tool-call loop. If FORGE does not invoke complete_handoff within 35 LLM turns, the pipeline fails with a structured error. This guarantees a bounded worst-case spend per story regardless of how the model behaves.

An unbounded interactive session has no equivalent guarantee. A model that gets stuck exploring dead ends has no automatic escape valve — the user must notice and intervene.

PROGRESS.md tracks the cost of every attempt including retries and explicitly records “wasted on retries,” so you can see exactly how much budget was consumed by failed attempts versus successful ones. That visibility makes retries a measurable cost line, not a hidden variable.

Prompt caching — system prompt reuse

Anthropic’s prompt cache stores the key-value state of a content block server-side for ~5 minutes. When the next call sends an identical prefix, the KV state is loaded from cache instead of being recomputed — at 10% of the normal input-token price.

PACE applies cache_control: ephemeral to the system prompt in both complete() and chat(). The system prompt is the largest and most stable part of any call — it carries the product description, source directory rules, tool definitions, and calibration instructions. It is identical across every iteration of the same agent loop.

Call pattern	Cache benefit
FORGE tool loop (up to 35 iterations)	Iterations 2–35 read the system prompt from cache — ~90% discount on those tokens
SCRIBE tool loop (up to 30 iterations)	Same — 2–30 iterations hit cache
PLANNER (N stories × `complete()`)	Stories 2–N read the same system prompt from cache
GATE, SENTINEL, CONDUIT, PRIME	Single-turn; benefit applies only on the compact-retry path

Cache pricing:

Token type	Billed at	Notes
`input_tokens`	100% of input rate	Non-cached new tokens
`cache_read_input_tokens`	10% of input rate	KV state served from cache
`cache_creation_input_tokens`	125% of input rate	One-time write premium

Break-even is 2 calls with the same system prompt. The write premium of the first call is recovered on the second.

PACE’s spend tracker separates all three token types and applies the correct pricing to each. The per-run summary() output shows cache columns when cache tokens are present:

[PACE] API usage this run:
  claude-sonnet-4-6: 8,120 in + 31,200 out + 112,500 cache_read + 14,200 cache_create = $0.6401
  Run total: $0.6401
  Cache savings this run: $0.3038 vs uncached

The cache_savings_usd figure is the difference between what those tokens would have cost at full input price versus the actual cache-read price — a concrete measure of what caching is saving per run.

Streaming — real-time output

Both complete() and chat() use the Anthropic streaming API (messages.stream()). Tokens are printed to stdout as they are generated rather than after the entire response is assembled.

Cost impact: zero. Token count and billing are identical to non-streaming calls. The value is operational:

Local runs: FORGE output (code, tool calls, reasoning) is visible immediately — no 30–90 second silent wait
CI logs: GitHub Actions, GitLab, and Jenkins all show streaming stdout in real time, making it easy to spot a stuck iteration before it burns through the remaining budget
Early error detection: a hallucinated file path or wrong tool call is visible after the first few tokens, not after the full iteration completes

FORGE context management

While PACE’s stateless architecture eliminates session-level context accumulation, FORGE itself runs a multi-turn tool-use loop within a single story. Without management, that per-story context grows monotonically — stale file reads, repeated test outputs, and write echoes compound across iterations. From v3.1.0, PACE applies four progressive stages to bound this growth.

The three growth drivers

A representative mid-sprint trace showed ~69,000 tokens in FORGE’s history at implementation start, of which ~90% was noise:

Driver	Tokens	Share
Stale file reads (files already rewritten)	~31,000	45%
Repeated `run_bash` outputs (same command, multiple runs)	~16,000	23%
`write_file` echo (full file content in tool result)	~15,000	22%
Signal content (acceptance criteria, live results)	~7,000	10%

Savings by stage

Stage	Mechanism	Tokens saved	Cumulative reduction
1	Eviction + dedup + suppression (always on)	~47,000	~68%
2	Haiku compression after RED phase (`compression_model`)	~20,000	~97%
3	Pre-seeded file hints (`file_hints_enabled`)	reduces exploration phase	fewer tokens before GREEN
4	Forked subcontext (`fork_enabled`, Phase A)	~30,000 (implementation phase)	fresh implementation baseline

Stage 1 requires no configuration. Stages 2–4 are opt-in via forge: keys in pace.config.yaml.

See FORGE Context Efficiency for configuration, monitoring, and per-stage details.

Summary: PACE vs a monolithic session

Property	Monolithic AI session	PACE pipeline
Context growth	Grows with every turn (unbounded)	Bounded per agent call
Context reset	Never (within session)	Every agent starts fresh
Codebase re-reading	Every session, every turn	Day 1 only; SCRIBE compression from Day 2
Retry cost	Compounds (prior turns remain in context)	Isolated — retry starts fresh
Cost per feature	Unpredictable ($2–20+)	Predictable ($1.50–3.50 typical)
Cost visibility	None (session-level at best)	Per-agent, per-story, with retry breakdown
Worst-case spend	Unbounded (no escape valve)	Bounded by max_iterations × model cost
Prompt caching	Not applicable	System prompt cached in all agent loops; ~90% discount on repeated tokens
Streaming output	Varies by tool	All agents stream tokens in real time to stdout and CI logs
FORGE message-history management	No (unbounded growth within session)	Stage 1: automatic; Stages 2–4: configurable via `forge:` keys