The fact that Intel and AMD apparently don't prioritize integer division could s...

masklinn · on May 12, 2021

> The fact that Intel and AMD apparently don't prioritize integer division could suggest that their profiling suggests it's not worth it, but with Apple's transistor budget at the moment they can afford it.

An other possibility is that Apple has a very different profiling base e.g. iOS applications, whereas Intel and AMD would have more artificial workloads, or be bound by workloads / profiles from scientific computing or the like (video games)?

pbsd · on May 12, 2021

Intel greatly improved their divider implementation between Skylake and Icelake. The measurements in the OP are on Skylake-SP, prior to these improvements.

rodgerd · on May 12, 2021

Apple's chip designers have the advantage, I assume, of being able to wander down a hallway and ask what the telemetry from iOS and MacOS devices are telling them about real-world use.

twoodfin · on May 12, 2021

They also have the advantage of controlling the compiler used by effectively the entire development ecosystem. And have groomed that ecosystem through carrots and sticks to upgrade their compilers and rebuild their applications regularly (or use bitcode!)

A compiler code generator that knows about a hypothetical two divide units (or just a much more efficient single unit) could be much more effective statically scheduling around them.

I’d guess that the bulk of the software running on the highest margin Intel Xeons was compiled some years ago and tuned for microarchitectures even older.

mhh__ · on May 12, 2021

>A compiler code generator that knows about a hypothetical two divide units (or just a much more efficient single unit) could be much more effective statically scheduling around them.

I'm still completely blind to how they are actually used but GCC and LLVM both have pretty good internal representations of the microarchitecture they are compiling for. If I ever work it out I'll write a blog post about it, but this is an area where GCC and LLVM are both equally impenetrable.

saagarjha · on May 13, 2021

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ... is probably a good start for LLVM.

mhh__ · on May 13, 2021

I meant the actual scheduling algorithm - from what I can tell GCC seems to basically use an SM based in order scheduler with the aim of not stalling the decoder. Currently, I'm mostly interested in basic block scheduling rather than trace scheduling or anything of that order.

mhh__ · on May 12, 2021

Most of Intel's volume is probably shipped to customers who either don't care or buy a lot of CPUs in one go, so the advantage of this probably isn't quite as apparent as you'd imagine.

What can definitely play a role is that (I don't think it's as much of a problem these days, but it definitely has been in the past) is the standard "benchmark" suites that chipmakers can beat each other over the head with e.g. I think it was Itanium that had a bunch of integer functional units mainly for the purpose of getting better SPEC numbers rather than working on the things that actually make programs fast (MEMORY) - I was maybe 1 or 2 when this chip came out, so this is nth-hand gossip, however.

anfilt · on May 13, 2021

The Itantium did care about memory its one of the reasons they had massive caches compared to other archs of the era.

criddell · on May 12, 2021

Would it be fair to characterize the M1 as being made for number crunching?

gameswithgo · on May 12, 2021

Not any more than an intel/amd/etc cpu is. Like that XEON cpu is gonna crunch more numbers, just due to more cores.

mhh__ · on May 12, 2021

If it was intended to be used in the cloud for example it's going to be doing more work but probably designed around a memory-bound load rather than integer throughput.

mhh__ · on May 12, 2021

Refining number crunching to mean single threaded performance I would say yes, or at least definitely more so than the Intel chip