Building an Inference Engine in 1,800 Lines of C++

dirk94018 · 2026-03-03T13:09:08 1772543348

Author here. This started because our C inference engine was slower than Python, which was annoying.

We got it to 400 tok/s prefill, 100 tok/s generate, 1,800 lines of C++, no dependencies beyond MLX. Just not redoing work was a 125x improvement.

Favorite moment: the model suggested enabling MetalFX to speed up inference. That's Apple's game graphics upscaler. It makes explosions look better.

AMA about any of it. We are working on the Qwen3.5 models. Local AI is going to get a lot better.