More

am17an · 2026-03-13T17:44:57 1773423897

You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090

am17an · 2026-03-05T17:39:48 1772732388

Sure. “Tell me a joke”

am17an · 2026-03-01T17:34:37 1772386477

Damn I’m jealous that they figured out how to pay their contributors. I’ve been toiling away for free

am17an · 2026-02-28T15:02:16 1772290936

They already have with qwen3.5

mft_ · 2026-03-01T13:53:40 1772373220

I agree with the previous post that there's hope that there's a convergence point in the not too distant future where consumer hardware can run powerful models.

At the moment, the 397Bn Qwen3.5 model (which I assume is what you're referring to) is still out of reach of most consumers to run locally: the only relatively straightforward path (i.e. discounting custom Threadripper builds) to running it would be a 512Gb Mac Studio.

However, in a generation or two (of hardware and models) maybe we'll see convergence with more hardware available with 3-400Gb of memory for more approachable money (a tough sell right now, I accept, with memory prices as they are) and models offering great performance in this size range.

am17an · 2026-03-01T18:03:35 1772388215

I was referring to the 35B version. It is surprisingly good for its size. You can use it for implementation tasks without it going off the rails

am17an · 2026-02-28T11:09:40 1772276980

What do you use for sub-50ms inference?

_boffin_ · 2026-03-02T04:59:15 1772427555

Could be bank statement line item Classification

am17an · 2026-02-28T08:36:50 1772267810

Honestly you can run this on a 16GB VRAM GPU with llama.cpp. Just try it!

am17an · 2026-02-21T11:05:49 1771671949

One often overlooked after that is ggml, the tensor library that runs llama.cpp is not based on pytorch, rather just plain cpp. In a world where pytorch dominates, it shows that alternatives are possible and are worthy to be pursued.

am17an · 2026-02-14T07:11:40 1771053100

Holy smokes we're cooked.

raincole · 2026-02-14T08:03:12 1771056192

I immediately flagged it. But it doesn't matter much. No one has skin in the game of commenting on HN anyway.

therobopsych · 2026-02-14T07:56:20 1771055780

Yeah that’s an LLM isn’t it? Commenting on outsourcing judgement. The dead internet is real

am17an · 2026-02-13T10:50:11 1770979811

Maintainers time is a more scarce resource than free tokens. I would much rather get my time back after reading those PRs

am17an · 2026-01-21T13:18:23 1769001503

1) Python is unreadable." Would you prefer C or C++?

> Unironically, yes. Unless I never plan to look at that code again