You need a harness, yes, and you need quality gates the agent can't mess with, a...

irthomasthomas · 2026-03-04T12:43:56 1772628236

Here is an example where the prompt was only a few hundred tokens and the output reasoning chain was correct, but the actual function call was wrong https://x.com/xundecidability/status/2005647216741105962?s=2...

vidarh · 2026-03-06T15:28:38 1772810918

Your point being? A proper harness will mostly catch things like that. Even a low end model can be employed to do write tests plans and do consistency checks that mostly weed out stuff like that. Hence: You need a harness, or you'll spend your time worrying about dumb stuff like this.