Skills Without Evals Are Just Optimism
Tribal knowledge encoded as an AI skill is still just text until you evaluate it. Ablation baselines, routing regression tests, trajectory autoraters, the gotchas flywheel: the evaluation infrastructure that keeps encoded knowledge from rotting.
ai-safety agents evaluation skills