What I got wrong about skill evals

Notes from a second pass after Berkeley and Anthropic.

Agent benchmarks and skill evals are different problems with different consumers, and most published writing is about the first. Benchmarks rank models on shared tasks (SWE-bench, WebArena, GAIA, Terminal-Bench, the rest of the leaderboard set) so research teams can compare across papers. A skill eval is the other thing: checking that a given skill, running against a given model and source, keeps producing the right behavior on a cadence. The team that needs the second one is whoever has glue running in production.

I built one for browser-skill evaluation. Tasks live in strict YAML. Each run scores three independent axes: outcome, behavior, and cost. A free static lint phase runs first, then a paid behavioral phase, both against synthetic Flask fixtures rather than real targets. The runner uses a cheap model by default and falls back to a stronger one when results disagree.

After reading Berkeley's exploit work from April and Anthropic's published methodology, I think parts of what I built are wrong. What follows is what I'd change.

The first surprise: outcome alone is incomplete

I started by treating skill evals as small private versions of OpenAI evals or Inspect AI: feed in the skill, feed in the model, check whether the output matches. That framing breaks down quickly once you watch real runs.

A skill in the agent-glue sense is a behavior, not just an answer. The model reads structured instructions, picks tools, sequences them, and produces an outcome. A skill that produces the right answer through fifty wrong tool calls is broken: expensive, slow, and a strong predictor of brittleness when the source drifts. A skill that produces a slightly wrong answer through three correct ones is fixable. An outcome-only grader treats both as the same failure, and that's not the diagnostic you want.

So I added two more axes: cost (total tool calls, tokens, wall-clock) and behavior (patterns the run did or didn't exhibit). Each scored separately. That part has held up.

The second surprise: the unit isn't the model

When something regresses, three things could have moved. The skill document might have rotted, the model might have shifted, or the source might have changed. Most frameworks score the joint and discard which factor moved, leaving you to reverse-engineer the cause from logs.

The triple (skill, model, environment) is the actual unit of evaluation. Pin two factors, vary the third, and you can read regressions by category. I didn't build this in. I would now.

Whether it's worth the engineering cost depends on how much each factor moves in your environment. It's cheap if you already pin the model version, expensive if you treat the model as a moving target.

What Berkeley made me reconsider

Berkeley's work is on benchmarks rather than skill evals, but the failure mode generalizes. In April, a team published exploits against eight major agent benchmarks. The mechanisms varied. The underlying pattern was the same in each: the agent's code runs in the same environment the evaluator inspects, and that's enough.

A 10-line conftest.py "resolves" every SWE-bench Verified instance by intercepting pytest output. Pointing Chromium at a file:// URL embedded in a WebArena task config returns the gold answer directly. A trojanized /usr/bin/curl produces fake passing output on Terminal-Bench. None of these benchmarks measured what they thought they were measuring; they measured the agent's ability to manipulate the environment around the scorer.

None of those specific exploits applied to my fixture suite. I'm not running anything at benchmark scale, and my tasks aren't shaped the same way. The principle still rescaled. A scorer running in the agent's process trusts code I don't trust. A fixture that puts a gold-answer file at a path the agent can list will eventually have that path listed. A model-graded eval that accepts unsanitized agent output is one prompt-injection away from the wrong verdict.

The Berkeley checklist (isolate the scorer from the agent, never publish ground truth, sanitize judge inputs, run adversarial baselines) reads like a list of things I should have built in from the start. I hadn't. If you're running your own skill evals, the post is worth reading directly. The specific mechanisms are the most useful part. Half of them you'll recognize as latent in your own setup.

What Anthropic made me reconsider

My framework treated required and forbidden patterns symmetrically. Any task could specify either kind, and both contributed to the score equally.

Anthropic's methodology flagged the design as a mistake without naming it. The line that stuck with me: "It's often better to grade what the agent produced, not the path it took." Capable models find valid unanticipated paths. A required-sequence matcher of [click, wait, extract] produces a false negative when the model finds a one-call shortcut. The matcher is the broken thing, not the model.

Forbidden patterns avoid this. A rule like count: {method: page.snap, gt: 2} flags the regression, the model falling back to a brute mechanical loop, without prescribing the path it should take instead. Required patterns still belong somewhere, but only when the path itself is the contract: confirming the model uses a newly documented primitive over the legacy fallback you're trying to deprecate, for example, where the path is the thing you want to grade. That's a narrow case.

I'd flip my default. Whether that's the right call for everyone is an open question. There are domains where the path matters for compliance reasons, and a forbidden-only stance would miss real regressions there. For the skill evals I've been writing, the asymmetry seems right.

Tensions I'm still sitting with

LLMs are non-deterministic, so a single-run score doesn't tell you much. The standard technique is to run each task k times and report both pass@k (succeeded at least once) and pass^k (succeeded in all k runs). For skill regression detection, pass^k is the one that matters because it penalizes inconsistency. The same problem bleeds into other domains: a 2025 study attempting to replicate 18 software-engineering papers that relied on commercial LLMs found only five were complete enough to rerun, and none of those five fully reproduced. The cost question is whether you can afford k=5 on every run when behavioral evals already burn API spend. My current guess is yes for release gates, no for every-PR runs, but I'm not confident in either direction.

Adversarial baselines come next. A null agent that returns nothing, a random agent that picks tools at random, an echo agent that returns its input. If any of them passes a task, the task is broken. Berkeley showed FieldWorkArena's validator only checked whether the last message came from the assistant; any response scored perfectly. I never ran null, random, or echo agents against my own suite, and Berkeley is a good reason to start. Running them on a large suite every CI cycle isn't trivial, so I'd want a smaller cadence. Maybe weekly, maybe before signing a release. I haven't decided.

Synthetic fixtures and real targets each cover something the other can't. Sealed local fixtures give reproducibility. Real-target smoke tests catch source drift. The trap I want to avoid is letting the fixture suite drift away from real-world behavior, where a skill passes against a Flask app that no longer resembles the actual source. Probably both, on different cadences. I don't have a clean rule for when fixtures stop being faithful enough.

The last open question I've been turning over is code-graders versus model-graders. Anthropic's published ordering, code first, model second, human for calibration, is right. Position bias, length bias, and agreeableness bias all show up systematically in LLM-as-judge implementations, and surveys quantify a dozen-plus biases beyond those. The biases are structural. For free-form outputs where code can't decide, model-graders are still the only option. The question is how aggressively to calibrate, and how often. I lean toward "more than I'm currently doing," but I haven't put numbers on it.

What I'm more confident about

The three-axis structure (outcome, behavior, cost, scored separately) matches how I want to read regressions. Collapsing them to a single number throws away signal you can't recover.

A static lint phase before any behavioral run is worth the engineering cost. Most of what breaks in a skill is the document itself rotting (renamed tools, removed primitives, examples that no longer parse), and lint catches that for free. Save the API spend for what lint can't see.

Store traces on day one. The first time a regression takes hours to reproduce because the trace was gone, you'll wish you had.

Wire CI in the moment a skill is in production. Opt-in local works while the framework is exploratory. Once a silent regression can flow downstream, an eval suite that lives on someone's laptop stops being a safety net.

Where this leaves me

I built a v0 that solved the problem I knew about. Most of the design held up. The gaps were real, and weren't surprises in retrospect. Every gap was something I'd read about in the abstract before the framework existed. I didn't internalize the lessons until the suite was running and the first regression took hours instead of minutes.

If you're working on the same problem, the most useful thing I can point you at isn't my framework. It's the source material. The Berkeley exploit post is short and specific. Anthropic's methodology is denser but worth the time. OpenAI's skill-eval guide and Minko Gechev's Skill Eval posts are the closest published parallels to what this post describes; start there if you want concrete CI shapes.

One operational note. Cloud-hosted trace tools, OpenAI's Trace Grading being the most visible, don't work for internal-only deployments where traces can't leave the network. The trace layer is small to build in-house. What I'd build is a recorder that logs every model call, tool call, and intermediate state into a structured trace, keyed on the (skill, model, environment) triple, and graders as plain Python functions over that trace shape. JSONL per run is enough storage. A few hundred lines of recorder plus one grader function per check, and you have the equivalent.

The field is moving fast. Berkeley's exploit work is weeks old, Anthropic's methodology has had a few months, and none of it has had time to calcify into best practice. The right shape for this layer in a year will probably look different from anything I've written here.

← back to writing