My pet peeve: using table lookups so a benchmark show faster when in reality your L1 cache is going to be stomped on. Not only will you be waiting on the L1 cache to populate but you also evict all the useful data.
I've seen this elsewhere too. For relatively mundane task (I think it was Morton code conversion), a giant lookup table was constructed and it gave a nice 2x or 4x performance improvement.
But no application will ever be doing only string despacing or Morton codes, so the "fast" lookup table algorithm will make everything else slower by evicting good cache lines. And once something else runs and evicts the lookup tables, the next run will be slow again.
https://raw.githubusercontent.com/lemire/despacer/master/inc...