This "debunking" is itself mostly plausible-sounding bunk.
It gets a lot of details simply wrong. For example, the 68030 wasn't "around 100000 transistors", it was 273000 [1]. The 80386 was very similar at 275000 [2]. By comparison, the ARM1 was around 25000 transistors[3], and yet delivered comparable or better performance. That's a factor of 10! So RISC wasn't just a slight re-allocation of available resources, it was a massive leap.
Furthermore, the problem with the complex addressing modes in CISC machines wasn't just a matter of a tradeoff vs. other things this machinery could be used for, the problem was that compilers weren't using these addressing modes at all. And since the vast majority of software was written in high-level language and thus via compilers, the chip area and instruction space dedicated to those complex instructions was simply wasted. And one of the reasons that compilers used sequences of simple instructions instead of one complex instruction was that even on CISCs, the sequence of simple instructions was often faster than the single complex instruction.
Calling the seminal book by Turing award winners Patterson and Hennessy "horrible" without any discernible justification is ... well it's an opinion, and everybody is entitled to their opinion, I guess. However, when claiming that "Everything you know about RISC is wrong", you might want to actually provide some evidence for your opinions...
Or this one: "These 32-bit Unix systems from the early 1980s still lagged behind DEC's VAX in performance. " What "early 1980s" 32-bit Unix systems were these? The Mac came out in 1984, and it had the 16 bit 68000 CPU. The 68020 was only launched in 1984, I doubt many 32 bit designs based on it made it out the door "early 1980s". The first 32 bit Sun, the 68020-based Sun-3 was launched in September of 1985, so second half of the 1980s, don't think that qualifies as "early". And of course the Sun-3 was faster than the VAX 11. The VAX 8600 and later were introduced around the same time as the Sun-3.
Or "it's the thing that nobody talks about: horizontal microcode". Hmm...actually everybody talked about the RISC CPUs not having microcode, at least at the time. So I guess it's technically true that "nobody" talked about horizontal microcode...
He seems to completely miss one of the major simplifying benefits of a load/store architecture: simplified page fault handling. When you have a complex instruction with possibly multiple references to memory, each of those references can cause a fault, so you need complex logic to back out of and restart those instructions at different stages. With a load/store architecture, the instruction that faults is a load. Or a store. And that's all it does.
It also isn't true that it was the Pentium and OoO that beat the competing RISCs. Intel was already doing that earlier, with the 386 and 486. What allowed Intel to beat superior architectures was that Intel was always at least one fab generation ahead. And being one fab generation ahead meant that they had more transistors to play with (Moore's Law) and those transistors were faster/used less power (Dennard scaling). Their money generated an advantage that sustained the money that sustained the advantage.
As stated above, the 386 had 10x the transistors of the ARM1. It also ran at significant faster clock speed (16Mhz-25Hmz vs. 8Mhz). With comparable performance. But comparable performance was more than good enough when you had the entire software ecosystem behind you, efficiency be damned Advantage Wintel.
Now that Dennard scaling has been dead and buried for a while, Moore's law is slowing and Intel is no longer one fab generation ahead, x86 is behind ARM and not by a little either. Superior architecture can finally show its superiority in general purpose computing and not just in extremely power sensitive applications. (Well part of the reason is that power-consumption has a way of dominating even general purpose computing).
That doesn't mean that everything he writes is wrong, it certainly is true that a complex OoO Pentium and a complex OoO PowerPC were very similar, and only a small percent of the overall logic was decode.
But I don't think his overall conclusion is warranted, and with so much of what he writes being simply wrong the rest that is more hand-wavy doesn't convince. Just because instruction decode is not a big part doesn't mean it can't be important for importance. For example, it is claimed that one of the reasons the M1 is comparatively faster than x86 designs is that it has one more instruction decode unit. And the reason for that is not so much that it takes so much less space, but that the units can operate independently, whereas with a variable length instruction stream you need all sorts of interconnects between the decode units, and these interconnects add significant complexity and latency.
Right now, RISC, in the from of ARM in general and Apple's MX CPUs in particular, is eating x86's lunch, and no, it's not a coincidence.
I just returned my Intel Macbook to my former employer and good riddance. My M1 is sooooo much better in just about every respect that it's not even funny.
I strongly agree with the point
the problem was that compilers weren't using these addressing modes at all
at least in the 80s microcomputer compilers were very primitive compared to what we have now which maintained a strong need for ASM. Dev tools used to be very expensive and proprietary too.
GCC started to slowly changes that starting by 1987.
So there was a time when software started to be mainly compiled high level language but using stupid compilers and CPU designers had to live with that.
>whereas with a variable length instruction stream you need all sorts of interconnects between the decode units, and these interconnects add significant complexity and latency.
I find worth noting this is not always the case.
e.g. RISC-V C extension provides variable length instructions, but they're still either 16 or 32 bit.
Special care has been put into making the decoding overhead of dealing with this situation negligible, and it is indeed so. There's benefit, transistor-budget-wise, the moment there's any on-die cache or on-die rom. Any chip that's smaller than that is going to be very specialized and can simply omit C. In any chip that's larger, C is a net benefit.
As a practical example, the RISC-V based Ascalon by Jim Keller's team is a 8-wide (like M1), 10-issue CPU.
However, you're absolutely right the wild sort of variable instruction length that is seen in CISC architectures like x86 is a huge issue that massively complicates implementations and outright imposes a practical limit in decoder width.
OTOH in aarch64, the adoption of a fixed instruction size, thus tanking code density, was unenlightened to the point of brain-dead, we see the cache sizes M1/M2 need just to deal with this, and I'm afraid ARM will be gone for other reasons (non-technical, to do with mismanagement) before they have a chance to correct course and re-introduce compressed instructions.
As for the rest of the article, I generally agree with you that it presents outright wrong information as facts and then tries to push the wrong conclusion. It is utter bull, practically nothing of value can be found in there. I'm not even surprised, as it is pretty much the norm in RISC opposition.
> e.g. RISC-V C extension provides variable length instructions, but they're still either 16 or 32 bit.
It's more than that. In RISC-V, you only need the first two bits of each instruction to determine whether it's a 16 bit or 32 bit instruction; you don't need to decode an instruction to know its length.
> [...] we see the cache sizes M1/M2 need just to deal with this, [...]
Do the M1/M2 need these cache sizes, or do they have these cache sizes because they can have these cache sizes, due to having a 4x larger page size by default? (Normally, page size wouldn't be that much of a problem for instruction caches, but for x86 it is because the x86 ISAs don't require explicit instruction cache invalidation on self-modifying code; x86 processors would likely have larger L1 instruction cache sizes if they could get away with it.)
> In RISC-V, you only need the first two bits of each instruction to determine whether it's a 16 bit or 32 bit instruction
Isn't it one bit in the beginning(?) of each 16-bit instruction? So a 32-bit instruction has this information duplicated in the same place in the latter 16-bit half, since a decoder has to be able to decide whether it's trying to decode a 16-bit instruction or whether it's in the middle of a 32-bit instruction.
The above assuming that the common strategy for implementing a parallel decoder for RVC is to start decoding at each 16-bit offset, and then throw away those cases where it turns out that it was in the middle of a 32-bit instruction, and that RVC has been designed with this implementation strategy in mind.
> Isn't it one bit in the beginning(?) of each 16-bit instruction?
No, it's the first two bits of every instruction (RISC-V is little-endian, so these are the least-significant bits). Two bits have four possible values, three of them are for 16-bit instructions, one of them is for 32-bit instructions.
> So a 32-bit instruction has this information duplicated in the same place in the latter 16-bit half, since a decoder has to be able to decide whether it's trying to decode a 16-bit instruction or whether it's in the middle of a 32-bit instruction.
No, that information is not duplicated. The decoder cannot know whether it's in the middle of a 32-bit instruction or not; it has to decode the length of all preceding instructions. That's why it's important that you can know the instruction length without decoding the instruction, so that simple logic can tell decoders other than the first whether they're in the start or in the middle of an instruction.
Huh, that's surprising. I looked it up and indeed you're correct. Well, oof. Though to be fair I don't now how much of an impediment that is for actually implementing very wide decoders in practice. Hopefully not too bad.
It gets a lot of details simply wrong. For example, the 68030 wasn't "around 100000 transistors", it was 273000 [1]. The 80386 was very similar at 275000 [2]. By comparison, the ARM1 was around 25000 transistors[3], and yet delivered comparable or better performance. That's a factor of 10! So RISC wasn't just a slight re-allocation of available resources, it was a massive leap.
Furthermore, the problem with the complex addressing modes in CISC machines wasn't just a matter of a tradeoff vs. other things this machinery could be used for, the problem was that compilers weren't using these addressing modes at all. And since the vast majority of software was written in high-level language and thus via compilers, the chip area and instruction space dedicated to those complex instructions was simply wasted. And one of the reasons that compilers used sequences of simple instructions instead of one complex instruction was that even on CISCs, the sequence of simple instructions was often faster than the single complex instruction.
Calling the seminal book by Turing award winners Patterson and Hennessy "horrible" without any discernible justification is ... well it's an opinion, and everybody is entitled to their opinion, I guess. However, when claiming that "Everything you know about RISC is wrong", you might want to actually provide some evidence for your opinions...
Or this one: "These 32-bit Unix systems from the early 1980s still lagged behind DEC's VAX in performance. " What "early 1980s" 32-bit Unix systems were these? The Mac came out in 1984, and it had the 16 bit 68000 CPU. The 68020 was only launched in 1984, I doubt many 32 bit designs based on it made it out the door "early 1980s". The first 32 bit Sun, the 68020-based Sun-3 was launched in September of 1985, so second half of the 1980s, don't think that qualifies as "early". And of course the Sun-3 was faster than the VAX 11. The VAX 8600 and later were introduced around the same time as the Sun-3.
Or "it's the thing that nobody talks about: horizontal microcode". Hmm...actually everybody talked about the RISC CPUs not having microcode, at least at the time. So I guess it's technically true that "nobody" talked about horizontal microcode...
He seems to completely miss one of the major simplifying benefits of a load/store architecture: simplified page fault handling. When you have a complex instruction with possibly multiple references to memory, each of those references can cause a fault, so you need complex logic to back out of and restart those instructions at different stages. With a load/store architecture, the instruction that faults is a load. Or a store. And that's all it does.
It also isn't true that it was the Pentium and OoO that beat the competing RISCs. Intel was already doing that earlier, with the 386 and 486. What allowed Intel to beat superior architectures was that Intel was always at least one fab generation ahead. And being one fab generation ahead meant that they had more transistors to play with (Moore's Law) and those transistors were faster/used less power (Dennard scaling). Their money generated an advantage that sustained the money that sustained the advantage.
As stated above, the 386 had 10x the transistors of the ARM1. It also ran at significant faster clock speed (16Mhz-25Hmz vs. 8Mhz). With comparable performance. But comparable performance was more than good enough when you had the entire software ecosystem behind you, efficiency be damned Advantage Wintel.
Now that Dennard scaling has been dead and buried for a while, Moore's law is slowing and Intel is no longer one fab generation ahead, x86 is behind ARM and not by a little either. Superior architecture can finally show its superiority in general purpose computing and not just in extremely power sensitive applications. (Well part of the reason is that power-consumption has a way of dominating even general purpose computing).
That doesn't mean that everything he writes is wrong, it certainly is true that a complex OoO Pentium and a complex OoO PowerPC were very similar, and only a small percent of the overall logic was decode.
But I don't think his overall conclusion is warranted, and with so much of what he writes being simply wrong the rest that is more hand-wavy doesn't convince. Just because instruction decode is not a big part doesn't mean it can't be important for importance. For example, it is claimed that one of the reasons the M1 is comparatively faster than x86 designs is that it has one more instruction decode unit. And the reason for that is not so much that it takes so much less space, but that the units can operate independently, whereas with a variable length instruction stream you need all sorts of interconnects between the decode units, and these interconnects add significant complexity and latency.
Right now, RISC, in the from of ARM in general and Apple's MX CPUs in particular, is eating x86's lunch, and no, it's not a coincidence.
I just returned my Intel Macbook to my former employer and good riddance. My M1 is sooooo much better in just about every respect that it's not even funny.
[1] https://en.wikipedia.org/wiki/Motorola_68030
[2] https://en.wikipedia.org/wiki/I386
[3] https://www.righto.com/2015/12/reverse-engineering-arm1-ance...