Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

  | I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

  ● Drive. The car needs to be at the car wash.
Wonder if this is just randomness because its an LLM, or if you have different settings than me?


My settings are pretty standard:

% claude Claude Code v2.1.111 Opus 4.7 (1M context) with xhigh effort · Claude Max ~/... Welcome to Opus 4.7 xhigh! · /effort to tune speed vs. intelligence

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. 50 meters is shorter than most parking lots — you'd spend more time starting the car and parking than walking there. Plus, driving to a car wash you're about to use defeats the purpose if traffic or weather dirties it en route.


To me Claude Opus 4.6 seems even more confused.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.


Just asked Claude Code with Opus-4.6. The answer was short "Drive. You need a car at the car wash".

No surprises, works as expected.


Yeah, it was probably patched. It could reason novel problems only of you ask it to pay attention to some particular detail a.k.a. handholding..

Same would happen with the the sheep and the wolf and the cabbage puzzle. If you l formulated similarly, there is a wolf and a cabbage without mentioning the sheep, it would summon up the sheep into existence at a random step. It was patched shortly after.


I’m not sure ‘patched’ is the right word here. Are you suggesting they edited the LLM weights to fix cabbage transportation and car wash question answering?


Absolutely not my area of expertise but giving it a few examples of what should be the expected answer in a fine-tuning step seems like a reasonable thing and I would expect it would "fix" it as in less likely to fall into the trap.

At the same time, I wouldn't be surprised if some of these would be "patched" via simply prompt rewrite, e.g. for the strawberry one they might just recognize the question and add some clarifying sentence to your prompt (or the system prompt) before letting it go to the inference step?

But I'm just thinking out loud, don't take it too seriously.


Used patched for lack of a better word. Not sure how they fix the edge cases for these types of fixes/patches or whatever they’re specifically called


They might have further trained the model with these edgecases in the dataset


Whatever it was, that’s not real thinking, we can possibly patch all knowledge and even if we did, it would become crystallize somehow.


What if it’s raining though? Car wash wouldn’t be open though it would waste gas


There is a certain amount of it which is the randomness of an LLM. You really want to ask most questions like this several times.

That said, I have several local models I run on my laptop that I've asked this question to 10-20 times while testing out different parameters that have answered this consistently correctly.


--------> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — driving there and back takes longer than walking, and you'd just be dirtying the car you're about to clean.

---------> haha but how can i wash my car then?

Walk to the car wash first to check what they offer — many have self-service bays where you bring your car, or full-service where they wash it for you. If it's self-service, then drive over with the dirty car. If full-service, drive over and they handle it.

Either way, the car needs to end up at the wash — the walk was just to scout it out. Or just drive straight there; 50 meters of extra dirt won't matter.

----------> you, you sound like a politician. Are you sure your first answer was really reasoned like your second answer?

You're right, my first answer was nonsense. I pattern-matched on "50 meters = walk" without thinking about the actual task. You can't wash a car without the car being there. Just drive.


I've tried these with Claude various times and never get the wrong answer. I don't know why, but I am leaning they have stuff like "memory" turned on and possibly reusing sessions for everything? Only thing I think explains it to me.

If your always messing with the AI it might be making memories and expectations are being set. Or its the randomness. But I turned memories off, I don't like cross chats infecting my conversations context and I at worse it suggested "walk over and see if it is busy, then grab the car when line isn't busy".


Even Gemini with no memory does hilarious things. Like, if you ask it how heavy the average man is, you usually get the right answer but occasionally you get a table that says:

- 20-29: 190 pounds

- 30-39: 375 pounds

- 40-49: 750 pounds

- 50-59: 4900 pounds

Yet somehow people believe LLMs are on the cusp of replacing mathematicians, traders, lawyers and what not. At least for code you can write tests, but even then, how are you gonna trust something that can casually make such obvious mistakes?


> how are you gonna trust something that can casually make such obvious mistakes?

In many cases, a human can review the content generated, and still save a huge amount of time. LLMs are incredibly good at generating contracts, random business emails, and doing pointless homework for students.


And humans are incredibly bad at "skimming through this long text to check for errors", so this is not a happy pairing.

As for the homework, there is obviously a huge category that is pointless. But it should not be that way, and the fundamental idea behind homework is sound and the only way something can be properly learnt is by doing exercises and thinking through it yourself.


Yeah, ChatGPT's paid version is wildly inaccurate on very important and very basic things. I never got onboard with AI to begin with but nowadays I don't even load it unless I'm really stuck on something programming related.


So what? That might happen one out of 100 times. Even if it’s 1 in 10 who cares? Math is verifiable. You’ve just saved yourself weeks or months of work.


You don't think these errors compound? Generated code has 100's of little decisions. Yes, it "usually" works.


LLM’s: sometimes wrong but never in doubt.


Not in my experience. With a proper TDD framework it does better than most programmers at a company who anecdotally have a bug every 2-3 tasks.


The kind of mistakes it makes are usually strange and inhuman though. Like getting hard parts correct while also getting something fundamental about the same problem wrong. And not in the “easy to miss or type wrong” way.

I wish I had an example for you saved, but happens to me pretty frequently. Not only that but it also usually does testing incorrectly at a fundamental level, or builds tests around incorrect assumptions.


I've seen LLMs implement "creative" workarounds. Example: Sonnet 4.5 couldn't figure out how to authenticate a web socket request using whatever framework I was experimenting with, so it decided to just not bother. Instead, it passed the username as part of the web socket request and blindly trusted that user was actually authenticated.

The application looked like it worked. Tests did pass. But if you did a cursory examination of the code, it was all smoke and mirrors.


Yeah recently it had an issue getting OIDC working and decided to implement its own, throwing in a few thousand extra lines. I'm sure there were no security holes created in there at all. /s


Well, the tests passed, right?


yes i wished i had safes some of my best examples too. One i had was super weird in chatgpt pro. It told me that after 30 years my interest would become negative and i would start loosing money. Didnt want to accept the error.


Errors compounding is a meme. In iterated as well as verifiable domains, errors dilute instead of compounding because the llm has repeated chances to notice its failure.


Yes, just use random results. You’ve just saved yourself weeks or months of work of gathering actual results.


Claude Opus 4.7 responds with walk for me with and without adaptive thinking, but neither the basic model used when you Google search or GPT 5.4 do.


Or, the first time a mistake is detected, a correction is automatically applied.


Idk but ironically, I had to re-read the first part of GP's comment three times, wondering WTF they're implying a mistake, before I noticed it's the car wash, not the car, that's 50 meters away.

I'd say it's a very human mistake to make.


> I'd say it's a very human mistake to make.

>> It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

It talks about starting, driving, and parking the car, clearly reasoning about traveling that distance in the car not to the car. It did not make the same mistake you did.


We truly do not need to lower the bar to the floor whenever an LLM makes an embarrassing logical error, particularly when the excuses don't line up at all with the reasoning in its explanation.


I don't want my computer to make human mistakes.


It may be inescapable for problems where we need to interpret human language?


then throw away the turing test


then don't train it on human data


LLMs do not have trouble reading, it didn't make the mistake you made and it wouldn't. You missed a word, LLMs cannot miss words. It's not even remotely a human mistake.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: