However, on recent CPUs 4x4 is small for the innermost block size of the non-trivial hierarchy you need. You can see examples under https://github.com/flame/blis/tree/master/config with an a priori procedure for determining them in https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analyti... (but compare with what's actually used for SKX, in particular). OpenBLAS will normally be similar, though it may come out somewhat faster, but it's easier to see in BLIS.
What makes you say that? I've had good speedups with Strassen multiplication in variable precision (or with floating-point, in case you meant fixed-point arithmetic).
Strassen method is faster than native multiplication, but it is not stable: you will get much larger rounding errors when implementing it using floating point numbers, compared to native multipllication. And fine precision is required for a lot of algorithm implemented using matrix multiplication, such as matrix inversion or gradient descent, so this is often a problem.
It depends. I have seen some algorithms (the example that comes to mind was a clustering) become worse solely due to numerical error.
When that happens, if you are not equiped to measure the numerical error or at least trained to suspect it, you might think that it is just the algorithm that is not working.
Yes, indeed, and Strassen is a bad default algorithm because of this. There are specialized situations where the numerical stability isn't an issue though.