Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because:

1. It's not 100TFLOPs - you need fp32 to accumulate dot product, at which point you're getting much less. But even if you do 16/16, there's no way you'll really get near the roof of that roofline model.

2. Each V100 is $7K

3. 4x V100s not only cost as much as a decent car, but require a specialized chassis and a specialized PSU: they're 300W _sustained_ each (substantially more in momentary power consumption), and require a powerful external fan to cool them properly.

I want 400TFLOPs of bfloat16 dot product / convolution bandwidth under my desk, in a reasonably quiet, sub-1KW power envelope.



1) not true. I have one and hit 100TFLOPS with large enough batches in fp32 accumulate mode. Other benchmarks agree, so I'm not sure what numbers you're suggesting.

4) maybe get a cluster of Jetsons?


https://arxiv.org/pdf/1803.04014.pdf - no matter what they did, they could not go beyond 83TFLOPs in fp16. And that's just matrix multiply. Any kind of deep learning workload is going to be a lot slower than that.


I have results showing otherwise. Maybe an old cuda?

Here is anandtech showing almost 100TFLOPS, and they didn't every try hard to tune it:

https://www.anandtech.com/show/12170/nvidia-titan-v-preview-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: