Cut the cost of long agent runs

Externalize agent state to disk so a slow or full worker respawns from a few KB, and keep planning on a separate cyclable chat.

The operating layer is Cursor. The orchestrator runs in a separate chat. This page is the reading version of the event slide deck.

The cost problem

A single long-lived agent re-sends its whole growing context every turn, so input tokens, which dominate the bill, climb worse than linearly as a session runs. Orchestration makes it worse, because planning and reconciliation are token-heavy reasoning that never touches code yet burns premium in-editor turns. Central Casting attacks both by putting the durable state on disk and moving the planning off the metered surface.

The .cca folder: state on disk

The durable truth lives in a .cca folder, one lane per work area. The chat is disposable, the folder is the source of truth.

.cca/ O1 .. O16/ a lane, the local orchestrator's home current_step.yaml lane step state MMDD_task/ a dated task home A1 .. An/ a worker actor in the lane drafts/ work the worker produced current_state.yaml the worker's externalized state

A worker's state is its current_state.yaml plus its drafts/. Nothing irreplaceable lives in the chat. That single property is what makes a clean kill and respawn possible.

Hydration: the kill and respawn

Worker running slow or context full Checkpoint to disk current_state.yaml + drafts Kill context to zero Fresh worker empty context Hydrate and resume read current_state.yaml and the drafts it needs the same loop, with a fresh agent

The fresh worker reads a few KB of state, not the prior thread. No replay tax, no stale or cross-task context carried in, which is what "without leaking context" means in practice.

The external orchestrator, and cycling it

Planning, routing and state reconciliation run in a separate chat called O0, outside the editor. Because O0 treats the .cca workspace as the source of truth and reconciles from it every turn, the chat itself holds nothing irreplaceable. When a chat slows or fills, you start a fresh one, reload the portable instruction layer and reconcile from the workspace.

O0 chat slows context full Fresh O0 chat load v5 instruction layer Reconcile from .cca workspace is the truth Resume routing

What the v5 O0 rules mean, in plain terms

The cycling works because of four rules the orchestrator follows. They are written tersely in the system, so here they are for a human reader.

  1. The workspace is the truth, not the chat. The .cca bundle is the primary state surface. O0 reads it every turn and treats it as authoritative over anything it remembers.
  2. Reconcile before acting. Before O0 interprets any worker return or operator request, it compares that input against the latest workspace state. If they disagree, it stops and writes a reconciliation surface, then proceeds once they agree.
  3. No stale memory. O0 refuses to rely on inferred continuity or prior-phase assumptions. This is the rule that makes the chat disposable, because nothing important is allowed to live only in the chat.
  4. The instruction layer is portable. The rules load into any fresh chat as a single block, so a new O0 is the same O0 after one hydration turn.

Put together, a slow or full O0 chat is not a loss. You open a new one, paste the instruction layer, point it at the workspace and it picks up exactly where the last one was.

What it costs, on a real program

These numbers come from one real five-week run of this system, read from its own step log.

7active lanes
14task homes
175checkpoints
~12O0 chats cycled

Roughly 70 to 85 percent fewer input tokens on the metered surface.

Two levers drive it, and the durable claim is the token reduction, with dollars as an illustration.

LeverWhat changesEstimated effect
External orchestratorAbout 12 context-filling O0 chats ran on a flat-fee chat subscription, off the metered in-editor turns. A slowing chat was cycled in one hydration turn.On the order of 12 to 24M input tokens moved off the metered surface, near $40 to $70 at $3 per 1M, replaced by about $20 a month flat.
Bounded workers and respawnEach worker ran scoped to one task home, near 10 to 25K tokens of context, where one monolithic thread grows to 80 to 150K.About 70 to 80 percent fewer input tokens per task, and a slow worker costs one hydration turn to replace.

Estimates depend on the model, the pricing mode and prompt caching. Caching lowers the absolute dollars, and the structural win, bounded context plus respawn from disk, still cuts both cached and uncached load.

Demo walkthrough

1

Show the cost

One long agent re-bills a growing context every turn, and planning burns premium turns that never touch code.

2

Open the .cca folder

State is externalized to disk per lane and per worker, so the chat is disposable.

3

Kill and respawn

Checkpoint a worker, kill it, spawn a fresh one and watch it hydrate from a few KB of state.

4

Cycle the orchestrator

Replace a slow O0 chat with a fresh one that reconciles from the workspace in a single turn.

5

Show the number

The token reduction, anchored to the real five-week run above.

Cost deck: the slide deck · Method deck: the walkthrough · In practice: the aimez.ai program · Source: GitHub