We track performance vs. the all-in cost of completing real engineering tasks, r...

digdugdirk · 2026-05-08T14:36:11 1778250971

Interesting! I've been thinking about how to create a similar type of evaluation system for myself. How do you handle tweaks to agentic tasks? Say that a model gets pretty close to what you want, so you just need a quick follow up prompt to the original response?

languid-photic · 2026-05-08T15:10:58 1778253058

Yes! It depends on the extent of changes needed.

If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.

If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.

If it's in the middle, I'll usually apply the best and write a follow on spec.

digdugdirk · 2026-05-08T15:59:29 1778255969

How does that get integrated into the scoring system? I'm imagining a scenario where a cheaper model may get close, but only needs a small follow up to get the desired result. How would this score in comparison to a larger model that got it right the first time - even if it may have been much more expensive overall?

languid-photic · 2026-05-08T16:43:53 1778258633

We also use a secondary signal from blinded multi-verifier reviews. Each verifier ranks the candidates, and those verification outcomes serves as an additional quality signal. It's somewhat similar to consensus labeling.

Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.

BugsJustFindMe · 2026-05-08T16:07:26 1778256446

It feels pretty weird that your ratings have:

gpt-5-4-high > gpt-5-4-xhigh

gpt-5-4-high > gpt-5-5-high

gpt-5-4 > gpt-5-5

gpt-5-2-high > gpt-5-2-xhigh

No other ratings I've seen show that.

languid-photic · 2026-05-08T16:37:14 1778258234

Yes, the signal we are measuring is quite different from most evals.

We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?

Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]

Almost every agent in a given run can pass tests at this point, but there is large separation during review.

[1] https://voratiq.com/blog/your-workflow-is-the-eval

BugsJustFindMe · 2026-05-08T19:10:47 1778267447

Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.

languid-photic · 2026-05-08T20:51:15 1778273475

My point is more reasoning often leads to worse "scope creep/churn, codebase fit, maintainability".

BugsJustFindMe · 2026-05-08T21:12:00 1778274720

I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.

languid-photic · 2026-05-08T21:33:04 1778275984

It’s mostly a bandwidth thing. We’ve seen the pattern consistently, but haven’t had time yet to write up the analysis carefully.

We are not the only ones to see the reasoning inversion.: https://arxiv.org/abs/2510.11977, https://arxiv.org/abs/2502.08235, https://arxiv.org/abs/2507.14417

lukewarm707 · 2026-05-08T15:27:24 1778254044

would be interesting to see some other labs:

- deepseek v4 pro

- glm 5.1

- kimi k2.6

- qwen 3.6 max

- xiaomi 2.5 pro

- minimax 2.7

- grok

languid-photic · 2026-05-08T15:49:32 1778255372

I agree!

So far we have been native harnessmaxxing, which simplifies things a lot.

The configuration space around open models is much larger. Eg which models, capability heterogeneity, which harness, networking, data egress / privacy, etc.

If anyone is getting very good production code out of open models, I'd love to do a user interview to better understand your setup. Email is in my bio.

thepasch · 2026-05-08T16:08:09 1778256489

With how much vendor harnesses are now actively steering the agent with their own instructions on top of user prompts, I think it’d be super interesting to see a comparison of one of the already tested models - so Opus 4.7 or GPT-5.5 - across a range of different harnesses that aren’t their native. OpenCode, Pi, Hermes, Kilo Code. The most popular coding-focused harnesses, basically.

languid-photic · 2026-05-08T16:38:59 1778258339

Agreed. Harness is really important. Especially since many labs are now post-training agents directly in their native harness.

(Which is why my prior is that third party harnesses would not perform as well. But I haven't actually measured this.)

cyberpunk · 2026-05-08T19:27:10 1778268430

OpenCode seems to give me better results than codex-cli, i’d be interested in seeing this too!

motbus3 · 2026-05-08T17:46:09 1778262369

But what situation seems to good to enable xhigh?