1. It's not 100TFLOPs - you need fp32 to accumulate dot product, at which point you're getting much less. But even if you do 16/16, there's no way you'll really get near the roof of that roofline model.
2. Each V100 is $7K
3. 4x V100s not only cost as much as a decent car, but require a specialized chassis and a specialized PSU: they're 300W _sustained_ each (substantially more in momentary power consumption), and require a powerful external fan to cool them properly.
I want 400TFLOPs of bfloat16 dot product / convolution bandwidth under my desk, in a reasonably quiet, sub-1KW power envelope.
1) not true. I have one and hit 100TFLOPS with large enough batches in fp32 accumulate mode. Other benchmarks agree, so I'm not sure what numbers you're suggesting.
https://arxiv.org/pdf/1803.04014.pdf - no matter what they did, they could not go beyond 83TFLOPs in fp16. And that's just matrix multiply. Any kind of deep learning workload is going to be a lot slower than that.
1. It's not 100TFLOPs - you need fp32 to accumulate dot product, at which point you're getting much less. But even if you do 16/16, there's no way you'll really get near the roof of that roofline model.
2. Each V100 is $7K
3. 4x V100s not only cost as much as a decent car, but require a specialized chassis and a specialized PSU: they're 300W _sustained_ each (substantially more in momentary power consumption), and require a powerful external fan to cool them properly.
I want 400TFLOPs of bfloat16 dot product / convolution bandwidth under my desk, in a reasonably quiet, sub-1KW power envelope.