You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090
I agree with the previous post that there's hope that there's a convergence point in the not too distant future where consumer hardware can run powerful models.
At the moment, the 397Bn Qwen3.5 model (which I assume is what you're referring to) is still out of reach of most consumers to run locally: the only relatively straightforward path (i.e. discounting custom Threadripper builds) to running it would be a 512Gb Mac Studio.
However, in a generation or two (of hardware and models) maybe we'll see convergence with more hardware available with 3-400Gb of memory for more approachable money (a tough sell right now, I accept, with memory prices as they are) and models offering great performance in this size range.
One often overlooked after that is ggml, the tensor library that runs llama.cpp is not based on pytorch, rather just plain cpp. In a world where pytorch dominates, it shows that alternatives are possible and are worthy to be pursued.
reply