Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Building an Inference Engine in 1,800 Lines of C++ (linuxtoaster.com)
1 point by dirk94018 11 days ago | hide | past | favorite | 1 comment
 help



Author here. This started because our C inference engine was slower than Python, which was annoying.

We got it to 400 tok/s prefill, 100 tok/s generate, 1,800 lines of C++, no dependencies beyond MLX. Just not redoing work was a 125x improvement.

Favorite moment: the model suggested enabling MetalFX to speed up inference. That's Apple's game graphics upscaler. It makes explosions look better.

AMA about any of it. We are working on the Qwen3.5 models. Local AI is going to get a lot better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: