We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
Interesting! I've been thinking about how to create a similar type of evaluation system for myself. How do you handle tweaks to agentic tasks? Say that a model gets pretty close to what you want, so you just need a quick follow up prompt to the original response?
If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.
If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.
If it's in the middle, I'll usually apply the best and write a follow on spec.
How does that get integrated into the scoring system? I'm imagining a scenario where a cheaper model may get close, but only needs a small follow up to get the desired result. How would this score in comparison to a larger model that got it right the first time - even if it may have been much more expensive overall?
We also use a secondary signal from blinded multi-verifier reviews. Each verifier ranks the candidates, and those verification outcomes serves as an additional quality signal. It's somewhat similar to consensus labeling.
Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.
Yes, the signal we are measuring is quite different from most evals.
We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?
Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]
Almost every agent in a given run can pass tests at this point, but there is large separation during review.
Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.
I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.
So far we have been native harnessmaxxing, which simplifies things a lot.
The configuration space around open models is much larger. Eg which models, capability heterogeneity, which harness, networking, data egress / privacy, etc.
If anyone is getting very good production code out of open models, I'd love to do a user interview to better understand your setup. Email is in my bio.
With how much vendor harnesses are now actively steering the agent with their own instructions on top of user prompts, I think it’d be super interesting to see a comparison of one of the already tested models - so Opus 4.7 or GPT-5.5 - across a range of different harnesses that aren’t their native. OpenCode, Pi, Hermes, Kilo Code. The most popular coding-focused harnesses, basically.
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
[1] https://voratiq.com/leaderboard?x=cost