Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am not really an ml dev so I don't understand most of it. It does sound ridiculous how it would even work work. Brilliant work and great article I enjoyed reading it

This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?

 help



No worries, happy to discuss anyway :)

MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.

This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.


>forces sparsity

That's branching and then coalescing, right? It selects a path that is weighted as being most beneficial to the input?

Given you pointed out how even the vertical part of the architecture allows for skipping layers anyway, isn't that essentially the same thing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: