Unfortunately, we can't change the past, and seemingly in the past it wasn't worth it to have a fast One True memcpy (and perhaps to a decent extent still isn't). I'm still typing this on a Haswell CPU, which don't have FSRM (rep movsb of 16 bytes in a loop takes ~10ns=36 cycles per iteration avg).
But, yeah it does seem that my 128 bytes of a quick search was wrong. (though, gcc & clang for '-march=alderlake' both never generate 'rep movsb' on '-O3'; on `-Os` gcc starts giving a rep movsb for ≥65B, clang still never does)
But, yeah it does seem that my 128 bytes of a quick search was wrong. (though, gcc & clang for '-march=alderlake' both never generate 'rep movsb' on '-O3'; on `-Os` gcc starts giving a rep movsb for ≥65B, clang still never does)