The cost of doing evals has gone down substantially...

Evals have been extremely important for building agentic products. This is a known fact, but the main problem has always been doing better or decent evals.

The hard part of evals is not just building the harness. It is more of:

Creating the dataset
Running it with different configurations
Figuring out which one compares, how one run compares to the other
And if the scores are low, why is the score low, digging deep
In two different data points, different questions, seeing if there are failure points, improving it, and so on and so forth

Basically, the hardest part is looking at the data and digging deep into different points. There are so many failure points. It's extremely exhausting, taking not just a good hunch but real understanding and connecting the dots as to why things are going wrong.

But a lot of this has changed over the last few months, especially with coding agents becoming so good and the harness around the coding agents improving at an exponential rate. I'll share one of my workflows for building evals and how I do it with Cursor.

The basic idea is you can use coding agents for these 3 things heavily:

1. Helping you build a dataset

For example, I've been creating an MCP dataset for MCP calls. I just got Composer to do a bunch of tool calls with a lot of different parameters, and then use that to build the data source, which will act as the result of the mock tool calls.

While this is one of the most critical paths, you can lean heavily on these and the new models to do a pretty good job at it, especially when the answers can be a bit deterministic.

2. Running the evals with different configurations

The second part is running the evals with different configurations, tinkering around with different configs and accumulating the results. Agents are great at this. Just give them a bunch of configs which are easy to configure and tell them to run different thinking modes, models, and whatever concerns your evals, and just let them rip. They'll run all of them, consolidate the data, and keep a good overall summary for you ready.

3. Looking at your eval runs

The third and the most important part is basically looking at your eval runs. You can lean on agents to do really good work here as well, sort of narrowing down which questions had the biggest delta and then going through the trajectory. Figuring out what went wrong, why the numbers are different, where the numbers are different the most, and all those things can be offloaded to the agent. What remains for you is to make the final judgment or just give this agent a direction in which it should compare multiple results.

The path of auto research

The next step is obviously the path of auto research. You can now just run this in a loop, tell the agent to keep tweaking things and running the evals till it gets it right. The only thing I'm hesitant to do a bit is evals where prompt tuning is involved. I don't think agents do a great job at it. They are a bit too verbose. They'll start p-hacking the prompts to make the questions pass, but still I think it's a pretty good way to get to Phase 1, even with prompt hacking and whatever the agent does, just to see if it ends up working well or not. Then you can extract their learnings from whatever the agent has done and restart the auto research pipeline in a way you would want to.