Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm convinced the path to ubiquity (such as embedded in smartphones) is quantization.

I had to int4 a llama model to get it to properly run on my 3060.

I'm curious, how much resolution / significant digits do we actually need for most genAI work? If you can draw a circle with 3.14, maybe it's good enough for fast and ubiquitous usage.



Earlier this year there was a paper from Microsoft where they trained a 1.58 bit (every parameter being ternary) LLM that matched the performance of 16 bit models. There's also other research that you can prune up to 50% of layers with minimal loss of performance. Our current training methods are just incredibly crude and we will probably look back on those in the future and wonder how this ever worked at all.


None of those papers actually use quantized training, they are all about quantized inference.

Which is rather unfortunate as it means that the difference between what you can train locally and what you can run locally is growing ever larger.


Indeed. I think "AI gold rush" sucks anyone with any skills in this area into it with relatively good pay, so there are no, or almost no people outside of big tech and startups to counterbalance direction where it moves. And as a side note, big tech is and always was putting their agenda first in developing any tech or standards and that usually makes milking on investments as long as possible, not necessary moving things forward.


There's more to it than that.

If you could train models faster, you’d be able to build larger, more powerful models that outperform the competition.

The fact that Llama 3 is significantly over trained than what was considered ideal even three years ago shows there's a strong appetite for efficient training. The lack of progress isn’t due to a lack of effort. No one has managed to do this yet because no one has figured out how.

I built 1-trit quantized models as a side project nearly a decade ago. Back then, no one cared because models weren’t yet using all available memory, and on devices where memory was fully utilized, compute power was the limiting factor. I spend much longer trying to figure out how to get 1-trit training to work and I never could. Of all the papers and people in the field I've talked to, no one else has either.


People did care back then. This paper had jumpstarted the whole model compression field (which used to be a hot area of research in early 90s): https://arxiv.org/abs/1511.00363

Before that, in 2012, Alexnet had to be partially split into two submodels, running on two GPUs (using a form of interlayer grouped convolutions) because it could not fit in 3GB of a single 580 card.

Ternary networks appeared in 2016. Unless you mean you actually tried to train in ternary precision - clearly not possible with any gradient based optimization methods.


> I spend much longer trying to figure out how to get 1-trit training to work and I never could.

What did you try? What were the research directions at the time?


This is a big question that needs a research paper worth of explanation. Feel free to email me if you care enough to have a more in-depth discussion.


Sorry, I understand it was a bit intrusively direct. To bring some context, I toyed a little with neural networks a few years ago and wondered myself about this topic of training a so called quantized network (I wanted to write a small multilayer perceptron based library parameterized by the coefficient type - floating point or integer of different precision), but didn't implement it. Since you mentioned your own work in that area, it picked my interest, but I don't want to waste your time unnecessarily.


Someone posted a paper that I didn't know about, but goes through pretty much all the work I did in the space: https://news.ycombinator.com/item?id=42095999

It's missing the colourful commentary that I'd usually give, but alas, we can't have it all.


thank you, that looks awesome.


That's wrong. I don't know where you got that information from, because it is literally the opposite of what is shown in the Microsoft paper mentioned above. They explicitly introduced this extreme quantization during training from scratch and show how it can be made stable.


I got it from section 2.2

> The number of model parameters is slightly higher in the BitLinear setting, as we both have 1.58-bit weights as well as the 16-bit shadow weights. However, this fact does not change the number of trainable/optimized parameters in practice.

https://arxiv.org/html/2407.09527v1


Exactly as xnornet was doing way back in 2016 - shadow 32bit weights, quantized to 1 bit during the forward pass.

https://arxiv.org/abs/1603.05279

I personally have a pretty negative opinion of the bitnet paper.


Thanks for the citation, I did my work in the area around 2014 and never looked back. That's a very good summary of the state of the field as I remember it.


What? That's the wrong paper. It is not even from Microsoft. This is it: https://www.microsoft.com/en-us/research/publication/bitnet-...

>we introduce BitLinear as a drop-in replacement of the the nn.Linear layer in order to train 1-bit weights from scratch


Section 2.2 from your paper, with less clarity and more obfuscation:

>While the weights and the activations are quantized to low precision, the gradients and the optimizer states are stored in high precision to ensure training stability and accuracy. Following the previous work [ LSL+21 ], we maintain a latent weight in a high-precision format for the learnable parameters to accumulate the parameter updates. The latent weights are binarized on the fly during the forward pass and never used for the inference process.

https://arxiv.org/pdf/2310.11453

The other paper had a much nicer and clearer introduction to bitlinear than the original Microsoft paper, which is why I used it. Uncharitably you might say that they aren't hiding the lead 10 paragraphs in.


They are not hiding anything, because this is standard behaviour for all current optimisers. You still get a massive memory improvement from lower bit model weights during training.


I'm sorry but you just don't understand what the paper is saying.


Do you want a cookie for joining the overwhelming majority?

Necessary precision depends on, unsurprisingly, what you're truncating. Flux drops off around q6. Text generation around q4.

The llms apple are putting in iphones are q4 3b models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: