A small eval loop for the humanizer skill
A case study in using Caliper to evaluate blader/humanizer, tighten voice calibration, and turn the improvement into an upstream contribution with regression coverage.
Archive
Essays, notes, and working models for AI safety, agentic systems, and trustworthy infrastructure.
A case study in using Caliper to evaluate blader/humanizer, tighten voice calibration, and turn the improvement into an upstream contribution with regression coverage.
Tribal knowledge encoded as an AI skill is still just text until you evaluate it. Ablation baselines, routing regression tests, trajectory autoraters, and the gotchas flywheel keep encoded knowledge from rotting.
Everyone is worried about AI reading things it shouldn't. That's the wrong threat model. The problem starts after the agent reads.
Skills bundle instructions, scripts, and MCP servers into a single installable package. That convenience is also the attack surface.
A few words on what this space is about.