Which one of these algos does Intel/Nvidia/Google use in their AVX/Cuda/TPU impl...

jl2718 · on Dec 18, 2021

Simple block partitioning, usually 4x4, with the full ordinary algorithm per block. This is the most numerically-stable and parallel.

https://developer.nvidia.com/blog/cutlass-linear-algebra-cud...

gnufx · on Dec 18, 2021

However, on recent CPUs 4x4 is small for the innermost block size of the non-trivial hierarchy you need. You can see examples under https://github.com/flame/blis/tree/master/config with an a priori procedure for determining them in https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analyti... (but compare with what's actually used for SKX, in particular). OpenBLAS will normally be similar, though it may come out somewhat faster, but it's easier to see in BLIS.

SuchAnonMuchWow · on Dec 18, 2021

and the most cache-friendly

saboot · on Dec 18, 2021

Strassen's method is the only one practical on modern hardware

gnufx · on Dec 18, 2021

If it's of interest, there's an account of fairly recent work in the "Strassen reloaded" paper https://dl.acm.org/doi/10.5555/3014904.3014983 or otherwise in https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf

jl2718 · on Dec 18, 2021

And only in fixed precision.

fdej · on Dec 18, 2021

What makes you say that? I've had good speedups with Strassen multiplication in variable precision (or with floating-point, in case you meant fixed-point arithmetic).

SuchAnonMuchWow · on Dec 18, 2021

Strassen method is faster than native multiplication, but it is not stable: you will get much larger rounding errors when implementing it using floating point numbers, compared to native multipllication. And fine precision is required for a lot of algorithm implemented using matrix multiplication, such as matrix inversion or gradient descent, so this is often a problem.

dnautics · on Dec 18, 2021

do we know if the rounding errors are a big deal for numerical methods that can tolerate inaccuracy (like gradient descent for machine learning)?

nestorD · on Dec 18, 2021

It depends. I have seen some algorithms (the example that comes to mind was a clustering) become worse solely due to numerical error.

When that happens, if you are not equiped to measure the numerical error or at least trained to suspect it, you might think that it is just the algorithm that is not working.

hamilyon2 · on Dec 18, 2021

Rounding errors, underflows and instability in floats are very well known problems, a big deal if you do anything but graphics.

klyrs · on Dec 18, 2021

You want to converge on a local minimum, don't you? We can't guarantee that with unstable algorithms.

jl2718 · on Dec 18, 2021

Speed is fine. The issue is numerical stability.

fdej · on Dec 18, 2021

Yes, indeed, and Strassen is a bad default algorithm because of this. There are specialized situations where the numerical stability isn't an issue though.