movdqu is 2 issue cycles, not 4. What he may be alluding is to the huge cost for loading across a cacheline, which is ~14 issue cycles (equivalent to an L1 cache miss). This doesn't apply to loads which are exactly split across a cacheline, i.e. 8 bytes on each side of the cacheline. You should be able to use this information to figure out how unaligned loads are implemented on all Intel chips.
It may be worthwhile for Core 2 to do a bunch of aligned loads and palignr them together, but I didn't feel like testing, as it would certainly have been slower on my i7. Patches welcome ;)
I'm a n00b wrt SIMD, but is it really better to do unaligned loads instead of aligned loads + shift or shuffle?
According to latency tables on Core 2 loading 128 bits from memory to an SSE register
Also movdqu seems to have 9 µops vs. only 1 µops for movdqa. Wouldn't there be bad throughput on those chunks of 4 movdqu?[All this only according to Intel Manuals and Agner's tables, aka no testing]
How do you guys test implementation performance?
Thanks!