Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Which one of these algos does Intel/Nvidia/Google use in their AVX/Cuda/TPU implementations?


Simple block partitioning, usually 4x4, with the full ordinary algorithm per block. This is the most numerically-stable and parallel.

https://developer.nvidia.com/blog/cutlass-linear-algebra-cud...


However, on recent CPUs 4x4 is small for the innermost block size of the non-trivial hierarchy you need. You can see examples under https://github.com/flame/blis/tree/master/config with an a priori procedure for determining them in https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analyti... (but compare with what's actually used for SKX, in particular). OpenBLAS will normally be similar, though it may come out somewhat faster, but it's easier to see in BLIS.


and the most cache-friendly


Strassen's method is the only one practical on modern hardware


If it's of interest, there's an account of fairly recent work in the "Strassen reloaded" paper https://dl.acm.org/doi/10.5555/3014904.3014983 or otherwise in https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf


And only in fixed precision.


What makes you say that? I've had good speedups with Strassen multiplication in variable precision (or with floating-point, in case you meant fixed-point arithmetic).


Strassen method is faster than native multiplication, but it is not stable: you will get much larger rounding errors when implementing it using floating point numbers, compared to native multipllication. And fine precision is required for a lot of algorithm implemented using matrix multiplication, such as matrix inversion or gradient descent, so this is often a problem.


do we know if the rounding errors are a big deal for numerical methods that can tolerate inaccuracy (like gradient descent for machine learning)?


It depends. I have seen some algorithms (the example that comes to mind was a clustering) become worse solely due to numerical error.

When that happens, if you are not equiped to measure the numerical error or at least trained to suspect it, you might think that it is just the algorithm that is not working.


Rounding errors, underflows and instability in floats are very well known problems, a big deal if you do anything but graphics.


You want to converge on a local minimum, don't you? We can't guarantee that with unstable algorithms.


Speed is fine. The issue is numerical stability.


Yes, indeed, and Strassen is a bad default algorithm because of this. There are specialized situations where the numerical stability isn't an issue though.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: