Gil Raitses
Measurement, evaluation and feedback loops
How I build the infrastructure that records what happened, judges it against an explicit standard and feeds corrections back, and how that makes a core agent reliably better over time.
Two programs show the pattern on two different subjects. Central Casting applies it to agent behavior, turning a run into an inspectable record with quality gates. pax applies it to a research objective, learning a metric from data and validating the result against ground truth. The same loop is what lets a coding agent improve under measurement, so each change rests on evidence.
Central Casting: measurement and audit of agent behavior
Central Casting runs agent work through a structured system. Every real state change becomes a checkpoint, so a run reads as a labeled event stream an evaluator can query. An alignment check compares the structured record against the human-readable views and flags divergence, which acts as a quality gate. Corrections land as superseding records on top of earlier ones, so the history stays auditable and a fix is recorded where the next session reads it.
Where it maps: this is evaluation infrastructure for agents. Checkpoints are the event log, the alignment check is an automated gate and superseding records are the feedback that prevents a repeat. The same surface lets a team see where an agent lost state, exceeded its role or diverged from intent.
pax: a measurement-to-evaluation loop on real data
pax is a bi-objective routing system for pedestrians. It learns a perceptual-stress signal from New York City camera data through a computer-vision pipeline, then scores every candidate route on an explicit metric that balances geometric distance against that measured stress, with a single tunable weight. It validates the result against an independent movement estimate, so the metric is checked against ground truth before it is trusted.
Where it maps: this is the full evaluation loop in miniature. A signal is measured from raw data, a metric scores candidates, an external source validates the score and the weight is tuned from evidence. That is the shape of an offline evaluation harness for agent outputs.
Applied to the Cursor ecosystem
Put together, here is how this makes the core agent reliably better over time.
- Instrument every agent action as a structured event. A run becomes an inspectable record of state changes that an evaluator can query.
- Define quality gates that hard-fail on divergence. The alignment check flags a structured-versus-readable mismatch, in the same way the revision-gate linter I run on every public artifact hard-fails on a prose violation before it ships.
- Evaluate against a measured objective and validate against ground truth. The pax pattern of a learned metric, candidate scoring and an independent check transfers directly to scoring agent outputs against quality signals.
- Close the loop with corrections that stick. A regression is caught, recorded as a superseding decision and prevented next time, and a weight is retuned from new evidence.