Diffusion requires a lot more computation to get results compared to transformers. Naively when I'm running a transformer locally I get about 30% GPU utilization, when I'm running a diffusion model I'm getting 100%.
This means that the only saving you're getting in speed for a diffusion model is being able to do more effective flops since the floats are smaller, e.g. instead of doing one 32bit multiplication, you're doing 8 4bit ones.
By comparison for transformers you not only gain the flop increase, but also the improvement in memory shuffling that they do, e.g. it also takes you 8 times less time to load the memory into working memory from vram.
The above is a vast over simplification and in practice will have more asterisks than you can shake a stick at.
GPU workloads are either compute bound (floating point operations) or memory bound (bytes being transferred across memory hierarchy.)
Quantizing in general helps with the memory bottleneck but does not help in reducing computational costs, so it’s not as useful for improving performance of diffusion models, that’s what it’s saying.
Exactly. The smaller bit widths from quantization might marginally decrease the compute required for each operation, but they do not reduce the overall volume of operations. So, the effect of quantization is generally more impactful on memory use than compute.
> To achieve measured speedups, both weights and activations must be quantized to the same bit width; otherwise, the lower precision is upcast during computation, negating any performance benefits.
tries to explain that.
What it means though is that if you only store the inputs in lower precision, but still upcast to say bf16 or fp32 to perform the operation, you're not getting any computational speedup. In fact, you're paying for upconverting and then downconverting afterwards.