Because: 1. It's not 100TFLOPs - you need fp32 to accumulate dot product, at whi...

shaklee3 · on Feb 5, 2020

1) not true. I have one and hit 100TFLOPS with large enough batches in fp32 accumulate mode. Other benchmarks agree, so I'm not sure what numbers you're suggesting.

4) maybe get a cluster of Jetsons?

m0zg · on Feb 5, 2020

https://arxiv.org/pdf/1803.04014.pdf - no matter what they did, they could not go beyond 83TFLOPs in fp16. And that's just matrix multiply. Any kind of deep learning workload is going to be a lot slower than that.

shaklee3 · on Feb 5, 2020

I have results showing otherwise. Maybe an old cuda?

Here is anandtech showing almost 100TFLOPS, and they didn't every try hard to tune it:

https://www.anandtech.com/show/12170/nvidia-titan-v-preview-...