Someone file a compiler bug? 14x difference between the readable code and the optimized code is a lot. The first code is extremely straightforward, you shouldn't have deal with that SIMD mess manually.
Failure to optimize is never a bug, by definition. Automatic vectorization of sequential code is a very difficult problem in general. Especially when your code still depends on testing single bytes in the input and vectorizing that is only possible using some very clever bit twiddling.
Can you come up with a vectorization optimizer pass that could do what the OP did in the article?
This isn't a compiler bug or a shortcoming, it's a non-trivial optimization. Especially given the aliasing between the input and output (ie. it's an in-place algorithm) this is going to be a very difficult optimization.