Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Intel updates mysterious ‘software-defined silicon’ code in the Linux kernel (theregister.com)
210 points by flipbrad on Dec 8, 2021 | hide | past | favorite | 205 comments


I was just thinking this morning that Intel segments their market by the class of person that is their customer.

"You're a consumer, you don't need error correcting memory! Only the upper class, I mean... your superiors... err.. sorry, enterprise customers need that feature. It's reserved for them, it is not for the likes of you."

Meanwhile AMD basically sells silicon by unit area. It's like buying gelato at the ice cream shop. You can ask for one scoop, two scoops, or three. You get to decide how hungry you are, that's it. There isn't a flavour with broken glass in it[1] served only to the working class stiffs, and with the glass-free gelato reserved only for the gentry.

[1] This is pretty much what a CPU without ECC is. It randomly crashes and corrupts your data. Look. Not every bite of ice cream has glass in it! There's very little glass in the ice cream. It's super rare, and Intel Pty Ltd tells you that this is an acceptable amount of risk for you to take, because you aren't as important as other people.


> This is pretty much what a CPU without ECC is. It randomly crashes and corrupts your data.

I use ECC RAM in systems with error counters.

Error correction is an extremely rare event, if it happens at all. This idea that non-ECC computers are crashing all the time due to memory errors isn’t true. I've also had unstable systems with ECC RAM and no error correction events, so it's not a magic bullet.

You can run something like memtest86 for days on end. You shouldn’t see any errors at all, unless your memory is bad.

The reality is that the average consumer doesn’t really want to pay the extra amount for ECC RAM. As you said, ECC has been available on AMD consumer platforms for a long time and there’s barely any uptake outside of people building servers and workstations with consumer-grade AMD CPUs.

The AMD (Consumer) ECC story isn’t perfect, either. It’s not officially supported so there’s a lot of debate around whether or not it’s actually working on certain consumer motherboards. It’s definitely not equivalent to the official ECC support of their higher-end parts.

Intel, on the other hand, actually did roll out a lot of i3 processors with full, official ECC support. These are (or were) a favorite for low-cost servers for this reason. It worked well and the support was great.


Error Correction isn't the important part of ECC: it's Error Checking. There was a very long thread recently about this in the context of row hammer over at the forums on Real World Tech. While ECC can't fix row hammer, it can let you know that something is wrong and the system operator needs to intervene. The same thing is needed on desktops. Without ECC you won't know that your system is failing and silently corrupting your data. Especially when DRAM vendors are shipping products that don't reliably meet the defined operating parameters in their data sheets. "Trust, but verify".

The extra cost for ECC is basically noise in the cost of a desktop system. I buy unbuffered ECC DIMMs for my AMD systems and they're basically maybe 5-10% more than a non-ECC DIMM. The SSD premium is more than that. Registered DIMMs have a higher premium mostly because they need to have additional ICs on the DIMM to handle the higher memory capacities that servers tend to support.

I really wish CPU vendors would just flat out require ECC on every system shipped. Knowing that main memory is failing would save so much time speaking as someone who has encountered DIMM failures in otherwise stable systems. Days of debugging what turns out to be a hardware issue is no fun.


Same thought from a slightly different perspective. Decades ago I built a networking stack from scratch. It's wasn't wonderfully reliable. Years later I figured out why. I was wet behind the ears back then, and I put my faith in ARCnet's 16 bit CRC. Had I put aside that blind faith for just a few seconds and looked at how many packets per second that we being sent versus the number that would get through then - well I didn't.

I'm going to claim it wasn't entirely my fault. ARCnet did automatic resends for you so if fixed what the CRC detected, and silently what it didn't detect through. I had no idea the errors were happening.

ECC is nice, but as you say I don't need fancy hardware to correct a malfunctioning RAM chip. I can do that perfectly well with a hammer. What I need is the hardware to tell when and where to deploy the hammer. Things were actually better 30 years ago - back then PC RAM had parity. It's cheap, fast and effective enough. I have no idea what went wrong.


> Without ECC you won't know that your system is failing and silently corrupting your data.

The likelihood that a RAM error is corrupting data silently is slim, given the same corruption could affect the RAM storing the OS kernel or your application stack.

If you have bad non-ECC RAM, you will notice issues that warrant further investigation (i.e. memtest). But that assumes you're savvy enough to understand what failing memory looks like.


It's not about "likelihood", it's about being able to trust the hardware. The CPU running in your system probably has ECC protected caches, and many RAS features that happen to be designed in because most CPU vendors design the features into their cores that are shared between consumer gear and enterprise servers. These features are a necessary part of building hardware that can be trusted with atomic scale features.

Furthermore, your comment reads like someone that has never spent weeks debugging issues that only happen under extreme load stress tests. A single marginal bit is not so simple to track down when memory allocation patterns are non-deterministic. I have no desire to go through that again if it can be avoided by spending an extra $20 per DIMM. My time is worth more than that. Hardware is not perfect, and anyone who trusts hardware blindly should really take a look at what's going on behind the curtain.

Fwiw, the nature of DRAM errors is different now compared to the studies published decades ago when cosmic rays would only disturb a single cell. These days minute physical manufacturing defects and electrical disturbances are the dominant failure modes. Educate yourself: read the papers published about row hammer. Hammering a row with reads can flip bits in adjacent rows, and even DRAMs that have supposed mitigation features can suffer from disturbances a row or two further away. It has already been proven that the DRAM vendors' hardware cannot be trusted! What more proof do you need that this is a real world requirement in 2021?


The real pain is you have no way to sort out the reason of crash. Without ecc, you don't know whether it is the driver you installed last day cause the crash, or the memory itself. Or even driver installed yesterday used the broken region of the memory. Who knows? And memtest won't even come close to 100% reproduce the problem. If it happens continually, then it probably work. For for something that probably crash 1 in a month? No, unless you are lucky enough. And that happen all the time if you overclock. That is basically why ddrm5 mandate in memory ecc. Because in that high frequency, data corruption is more likely to happen.


The most painful case I had was actually bad memory in a FIFO in an ethernet switch. A few months before one of the Red Hat releases, IT had installed a brand new shiny enterpricey Ethernet switch with lots of line cards and many ports. This was one of the early VLAN capable switches. During the release validation cycle, QA found that one of our stress test workloads was occasionally failing with single bit flips in filesystem data from time to time.

Being kernel developers, we were majorly concerned. There were lots of changes to filesystem, VM and other parts of the kernel, and we couldn't rule out that we had a bug where code was using a stray pointer or some other problem. The stress tests had successfully passed last release...

Ultimately we tracked it down to network data being corrupted. The shiny new ethernet switch was rarely flipping a bit in packets, then happily fixing up the packet's CRC and IP checksum so that the computers on either end of the network link had no idea that the data was mangled in the network fabric. Oh the hours of brain wracking pain that caused!


> The shiny new ethernet switch was rarely flipping a bit in packets, then happily fixing up the packet's CRC and IP checksum so that the computers on either end of the network link had no idea that the data was mangled in the network fabric.

shudder


> I buy unbuffered ECC DIMMs for my AMD systems and they're basically maybe 5-10% more than a non-ECC DIMM

Where? Are you talking DDR4-3200? These are 40-50% higher than non-ECC UDIMMs. It usually gets worse as capacity increases. Additionally, there are only one or two models of 1x32GB UDIMM in production. These sticks are quite rare. You don't just go to Newegg or Amazon unless you want to pay out the arse for them. And in addition to that, ECC has worse timings which are important for Ryzen.

> Registered DIMMs have a higher premium

You have this backwards. RDIMM are common and cheaper because that's what servers generally use. UDIMM ECC cost more because, well, they barely exist as a thing.


UDIMMs don't have worse timings. The only bad thing is that they don't offer binned dies with XMP profiles on their SPD. My Kingston Server Premier 32G sticks with Micron Rev. E 16Gbit dies (KSM32ES8/16ME) run these timings very reliably at 1.4V:

BankGroupSwap: Disabled BankGroupSwapAlt: Enabled Memory Clock: 1800 MHz GDM: Enabled CR: 1T Tcl: 18 Tras: 39 Trcdrd: 22 Trcdwr: 22 Trc: 83 Trp: 22 Trrds: 5 Trrdl: 9 Trtp: 14 Tfaw: 38 Tcwl: 18 Twtrs: 5 Twtrl: 14 Twr: 26 Trdrddd: 4 Trdrdsd: 5 Trdrdsc: 1 Trdrdscl: 5 Twrwrdd: 6 Twrwrsd: 7 Twrwrsc: 1 Twrwrscl: 5 Twrrd: 3 Trdwr: 9 Tcke: 0 Trfc: 630 Trfc2: 468 Trfc4: 288

Yes, they are somehow really hard to get in the US for decent pricing, but in the EU they are quite affordable (according to geizhals.eu ).


OC'ing ECC is generally a terrible idea, if that's what you're doing. Also, no one on Ryzen is considering CAS CL22 timings "good" on DDR4-3200 sticks. I think you need to go do research on non-ECC RAM. It really feels like everyone on this thread is so far outside the normal PC market for RAM that they are talking out of their ass here.


> OC'ing ECC is generally a terrible idea, if that's what you're doing.

How so? For example I can see when the higher summer temperatures becomes an issue and reduce my timings, then tighten them again as it gets colder. No more random crashes where I wonder if it's my RAM or something else.

Currently doing 3333MT/s CL14 on 4x16GB of 2666MT/s ECC. I use CL16 in the summer. Since the memory sticks barely are affected by the heat, and that ECC errors occur on the same slot of the motherboard regardless of which stick is in there, I assume its the memory controller on the CPU in combination with four sticks of memory that's the limiting factor here.


I even got my 2 32G sticks in the same channel to 3600, but it wouldn't train with antyhing better than CL28 or therabouts. I heard that's likely due to the imbalance and/or quad-rank making the memory controller unhappy, as the PHY should be agnostic to the CL setting.

I mostly just did that test to check if I'd have to give up the 3600 speed if I'd upgrade from 64GB 128GB on this 5950X; I'm well aware that one shouldn't be running it single-channel, especially not if that would mean quad-rank operation.


The notably part is that I operate it at above 1.2V, which increases power consumption in a way that enterprise generally doesn't deem worthwhile. I run DDR4-3600 CL18.

I do question though how OC'ing non-ECC is a better idea than OC'ing ECC. Also, just for the record, I based my timings off of an XMP profile for a non-ECC version of the same die/capacity configuration (32G 2Rx8 w/ Micron Rev.E 16Gbit).

I also haven't seen 32G sticks that do better than CL22 at 3200 with just 1.2V.


Nexmix is what most vendors use: https://nemixram.com/64gb-4x16gb-ddr4-3200-pc4-25600-ecc-udi.... Pricing seems on par…


DDR4-3200 is an expensive boutique product. Drop the speed requirement a little and you'll find the pricing is in line with my comment.


DDR4-3200 is already fairly slow for a new system -- DDR4-3600 f.e. aligns with the 1800 MHz Fabric Clock in Ryzen 3 and makes a not insubstantial difference in performance, and (non-ECC-wise) is not expensive.

I have a feeling the cost is not just a money-wise 5-10%.


Amazon currently lists the 32GB Nemix DDR4-2666 UDIMM for $169.99 while a non-ECC UDIMM is $129.99. So you're right, at $40 it is closer to a 30% premium right now. But that one time cost of $40 per DIMM is worth it if it even saves me 10-15 minutes of time debugging. And I'm willing to pay that small cost adder on every new machine I get, because I have the battle scars that have pro It's not like I'm dealing with hundreds of systems, this is across maybe 2 or 3 dozen machines over the past 30 years.ven it's a worthwhile tradeoff.

The battle scars are as follows: - The 2MB expansion card on my Amiga 500 turned out to have a couple of bad bits, and because I was a green techie back then, I just attributed it to general software instability. Years later a RAM test showed there was a problem. Sadly there were no memory test utilities shipped with Amigas back then. - My first 486 system had a bad bank in the SRAM on the motherboard. Took years to figure out as it only really hit when heavy DMA occurred from a VLB SCSI card while the CPU was heavily loaded. Swapped the SRAM out and it was stable for a few more years of service. - One of my K6 systems had a DIMM go bad. At least it failed badly enough that the system couldn't boot. - I've had an Intel Xeon in a colocated server start throwing ECC errors in the L3 cache. Replaced and RMAed the CPU before it caused an outage. - One system started throwing ECC errors because the power supply was marginal.

Now, please, show me how the extra cost of ECC is worthless to you when things like this happen in the Real World. Is your time spent debugging hardware failures really not worth the cost of ECC?


Thanks for the downvotes, folks. Really.

> DDR4-3200 is an expensive boutique product.

You have to be fucking kidding me. On a desktop Ryzen?? Most desktop PC builders are gamers. That's just a fact. They drive the market. That's why every motherboard looks like a gamery stealth bomber. Outside of Asrock Rack and one or two Asus Pro boards, there is simply nothing out there that matches what the server world is seeing with SuperMicro.

I have 1x32GB DDR4-3200 UDIMM ECC on my Ryzen. But I have no delusions about the cost. It was expensive as fuck compared to non-ECC RAM.


(And there's no way to get ECC memory with decent timings because DDR4 is binned to death, so even if you found some hypothetical Samsung B-die ECC SKUs they'd be like 2333 or 2666 MHz and you'll never get them to work at 3200 or 3600 with timings better than 80-80-80-10000 or so. Meanwhile 3200CL14 or 3600CL16 is pretty usual without ECC.

edit: On a second look, Mushkin will actually sell you 3200 CL14 ECC memory... 14-18-18-38 that is. At almost 10 quid on the gigabyte. Meanwhile you can get real CL14 non-ECC memory at around half that.


I think the worry is less about crashes and more about silent corruption. Rare, sure, but not that rare:

"Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month"

https://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electroni...


More recent, larger-scale studies with modern hardware showed about a 3% chance of a DIMM having an error in a given year: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

But these weren't just cosmic-ray-induced errors. The DIMMs with errors were far more likely to have more errors in the future, pointing to faulty memory cells.


Where are you getting the 3 percent chance? Is that per DIMM, per machine? How does that align with these passages?

"About a third of all machines in the fleet experience at least one memory error per year"

"The median number of errors per year for those machines that experience at least one error ranges from 25 to 611."


"Across the entire fleet, 1.3% of machines are affected by uncorrectable errors per year, with some platforms seeing as many as 2-4% affected" is enough to worry me.


Ah, "uncorrected". That leaves "corrected", which only happens for ECC machines, correct?


I prefer more ECC than less, but as a friend of mine noted, most of the data are JPEGs (or video for that matter) and bit flips there usually go unnoticed because they simply adjust the pixel color slightly. So the chance that 1-in-2048-million bits is actually an important metadata bit that corrupts data structures is pretty low.


If only they did crash. It is the undetected corruptions that are far more dangerous than the crashes.


As someone who has been aware of ECC RAM for a long time could you briefly explain a concrete case where that is useful? Like, in all my time using computers I don't think I've yet encountered an error caused by RAM, or.. Did I?


You probably didn't even realise it was a memory error. A lot of the instability of Windows can be attributed to the use of non-ECC memory in typical home PCs. Linux tends to run on servers with ECC memory, so at least some of the perspective of it being "more stable" comes from this anecdotal evidence.

The parent comment saying that they never see ECC errors in the wild is missing a few things:

- Server memory tends to be clocked lower than consumer memory, so errors are less frequent to begin with.

- The errors are not evenly distributed. Some memory sticks have a high error rate, others are virtually zero. There's batch-to-batch variations.

- I've done my own tests on hundreds of servers. We run burn-in tests for about 24-48 hours. About 95% have zero errors of any kind, but 5% have a high enough rate that putting them into production would be a mistake. ECC allows us to catch those bit errors instead of silently accepting them and allowing data corruption to creep in.

- I've personally had 3 different personal computers experience high memory error rates, to the point of multiple BSODs per day and data corruption. They all started off "good" and slowly turned "bad". The only reason I knew to look for memory corruption as the root cause is because of my extensive industry experience. A grandma using the same PC would have just blamed Windows for being unstable.

- Vendors like Microsoft simply ignore all crash error reports sent back by telemetry with only 1 or 2 samples, because those are virtually guaranteed to be caused by memory corruption, not programmer error. I've done similar memory dump collection and found that easily 30% of all crashes were unique in this way, suggesting that ECC memory could improve PC stability significantly.

- Suggesting that ECC memory is not needed because "good" memory doesn't need it and only "bad" memory is a problem is missing the point. All memory is bad, it's just that the bit error rates are different!


I recently had an interesting experience with bad non-ECC RAM. The only effect I saw for many months was a frequently crashing web browser. I was very annoyed and downgraded to the firefox LTS release to try to solve the issue. The problem went away for a while, then came back. Seemingly erratic the problem could resurface only to disappear a while later.

Eventually I got a corruption in a large Git repository with photos. First I suspected a disk error but reruns of "git fsck" reported different bad objects, run to run. How odd, so I ran memtest86 and it reported a bad 32 MB memory region at offset 4 GB. Never saw any kernel issues or other instability. Booting up the computer does not use up 4 GB of memory (Linux) but starting a bloated web browser does.

The computer is stable after mapping away the bad memory region by using GRUB_BADRAM. That was an new takeaway for me, a machine with ECC memory can do this bad blocks mapping automatically. It's not just beneficial in correcting single bit errors.

I would love to have ECC memory in my machine. I used it for my previous build in 2012 but I think the situation has gotten worse since then. Bigger price difference and ECC memory is not even available at the same clocks as non-ECC memory.


Practically speaking, ECC is extra insurance against faulty memory.

It's true that ECC will catch errors caused by cosmic rays, but in practice most memory errors are just from faulty memory. Large studies in Google datacenters showed that the errors were heavily concentrated in a small number of DIMMs: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

The catch is that you don't really know if you have one of these faulty memory sticks. I always run memtest86 overnight to check for obviously faulty memory, but some of these errors could take months to manifest.

I had a laptop with a RAM cell that failed. It would show up in memtest86 in a matter of minutes, yet surprisingly I didn't notice the issue for a very long time in day to day usage. I always wonder if there were random bit flips in anything I worked with during that period, but I'll never know.

With ECC, it's just one less thing to worry about.


One part of the equation is that a shit-ton amount of memory modules are, in fact, overclocked (e.g. do you think your DDR4-4800 runs at nominal speed?). That surely contributes to memory being faulty.


DDR4 Memory these day are rarely runs at their default frequency (2666,2333) unless you buy a really cheap one. Most 3200/3600 memory are in fact overclocked(Just the factory pre-configure the overclock setting with xmp) except for a few ones that are natively 3200(some memory from micron are natively 3200 at 1.2v).


Well there's a bunch of problems. Could be high energy particles flipping bits, but that's very rare. Could be a row hammer attack flipping bits. Could be a flaw in the memory chip, dimm, dimm socket, motherboard traces, CPU socket, or inside the CPU.

Without ECC any of the above will crash an app, or even the entire machine. With ECC you'll get a log message, and if it's a single bit error it will be automatically repaired. If it's not fixable and it's in userspace that application is killed (and the kernel logs the error), if it's in kernel space the kernel will panic.

So generally it makes your machine more reliable, and easier to debug. Generally repeated errors are a fault of some kind and you can easily tell which dimm it is. Without ECC you end up troubleshooting all causes of crashes, and even if you know it's memory, you can't be sure which dimm it is.

So I think it's well worth the minimal premium to make your machine more reliable, and if it's unreliable it's much easier to track and fix the problem.

Amusingly without ECC heavy CPU use with parallel gcc compiles causes a particular error. Not sure if it's in the FAQ, but it's well know that the particular error means you have a CPU/RAM problem, common with memory errors, CPUs that are too hot, or overclocked CPUs.


Anecdote: I once lost a lot of files over a period of ~two years on a desktop PC running ZFS with bad non-ECC RAM. I figured out what was going on after a bunch of audible corrupt blips kept appearing in all my favorite FLAC rips as they got mapped into the bad memory, checksummed against the contents still on disk, and written back "corrected" as far as ZFS was concerned. There's a better (i.e. not by me) write-up of it in the OP of this thread under "What happens when non-ECC RAM goes bad in a ZFS system?": https://www.truenas.com/community/threads/ecc-vs-non-ecc-ram...


Thank you, this is a perfect illustration and one of the scenarios I feared around keeping data long-term, though I was thinking about disk errors and such.

Sidenote I seem to have experienced Cunningham's Law for the first time semi-consciously, as I was aware my claim is most probably wrong. Still, I wonder if data corruption is a more common scenario rather than crashing, because ever since I use Linux I don't recall experiencing many crashes I couldn't attribute to a driver issue or known instability in the software I used, especially on compatible hardware. Windows days were another story... But maybe the memory modules were worse back then and perhaps having less memory made it more likely to crash?


How would you ever know? If your computer or a program has ever crashed or acted weird. That's a symptom of memory errors. It's also a symptom of a million more common issues but you would never know.

There are some bugs that get raised on linux once every few years that are so obscure they are suspected to be random memory errors.


You will find it useful if you crashed 1 in a month and realize turn the memory frequency down for a bit fixed it forever.(Technically I am still guessing, without ecc, you can't get the answer) Seriously, the hardware should be able to determine whether the configuration is safe or not instead of by randomly guessing yourself. Even the memory labeled as work as 3200, it isn't necessary the case when you actually use it.


> Error correction is an extremely rare event, if it happens at all.

your systems must all be running at sea level.


"I was just thinking this morning that Intel segments their market by the class of person that is their customer."

Just about every company in the world that sells things (including AMD) engages in market segmentation.

A classic - https://www.joelonsoftware.com/2004/12/15/camels-and-rubber-...

It has nothing to do with the "class of person" -- that whole notion seems pejorative in this context -- but maximizing how much you can extract in the aggregate.

In the case of Intel, for years (decades?) they were their own primary competitor. They wanted to be sure that for a given customer they could extract the maximum amount possible, and they did this by gating some features (ECC, AVX, etc) to try to avoid business customers deciding to get by on "lesser" processors. And the simple truth is that the overwhelming bulk of consumer products will never, ever have an ECC relevant error event, which is how businesses managed to get by without it.


> and Intel Pty Ltd tells you that this is an acceptable amount of risk for you to take

AMD does the same thing too. Ryzen APUs don't have (even unofficial) ECC support, except if you pony up for a Ryzen Pro one.

(and so do many other SoC providers...)


Slightly different in the AMD case as the Ryzen Pro are the same socket and not really much more expensive (eg, https://www.amazon.com/Ryzen-4750G-Processor-3-6Ghz-Threads/... ). So as a consumer you can make a fairly straightforward value prop tradeoff there. Well, you could except AMD throws in the complication that they no longer sell APUs directly to consumers, not officially anyway. These are all technically "OEM only" parts.


5600G and 5700G are sold directly to consumers, the Pro variants aren't. Only the 4000G series was completely OEM-only for some reason.


Official support, doesn't mean it doesn't work. Although your right AMD does a bit of product segmentation here/there. The laptop space while considerably simpler is not only segmented by core count, but threading, and TDP.

I've been expecting them to just simplify to a core count/memory channel/socket matrix for a while because they are so close. There really isn't any reason these days to segment by frequency beyond maybe a couple golden parts where all the cores run at the peak frequency.


> Official support, doesn't mean it doesn't work

For Ryzen CPUs (ones without an iGPU), they don't fuse it off, and as such it can unofficially work.

For Ryzen APUs (parts with an iGPU), they fuse it off, and as such it will never work there.


Do you think every price should be set according to marginal cost? Since the marginal cost of a chip is extremely low after you've paid for the fab and design costs, should those be free as well?


In general you want to get more money from people who have more money. Everyone does this, including nonprofits and governments. There are lots of ways to wrap it up in things like value, status, or features and benefits, but ultimately that’s the basic goal.


Intel's cost is designing the circuits and the means to etch them, not buying the sand.


Seems like a pretty heavy analogy for offering a cheaper SKU with a cut feature.


>It randomly crashes and corrupts your data

My PC doesn't have ECC memory and it isn't crashing and corrupting my data. I think you are vastly over exaggerating.


A long while ago, I worked on part of a memory system that during development we knew was corrupting RAM. The amazing thing, is just how high the error rates could get before the machine would actually crash. That is largely because so much of RAM is code that doesn't get executed and data structures that are resistant to random corruption. Even actual dat corruption when it happens is frequently subtle because it flips a transaction timestamp by one digit in the midst of millions of transactions/etc, or the data is transient a video frame with a miscolored pixel, etc. Its really like playing Russian roulette with a revolver that has a few (b/m)illion chambers.

Generally by the time you see actual program crashes or notice data corruption your system is really good and screwed. That is part of why people in the know are so afraid of RAM corruption. It can be persisted and exist silently for a very long time before someone tries reading some file/transaction that was corrupted during a write years back, or the database suddenly starts crashing after it updates some index as part of a GC pass/whatever, while the actual bug/HW failure will never be reproduced.


I've had my own experience with a bad stick of RAM, but that just caused Firefox to crash.


If you luck out with reliable hardware the rate of pure bit flips from high energy particles is rather rare.

Problem is if you start having unreliable hardware it could be any component of your system. ECC memory helps track down a class of errors that could be in the ram chip, dimm, dimm socket, motherboard, CPU socket, or CPU.

My desktop cost another $100 or so because I got a E3-1230 xeon instead of the similar clocked i7. I bought it in 2015 and it's been fast, and reliable. Sure I might manage 6 month uptimes anyways, but I also might track down a dimm problem in hours instead of weeks.


Not that it happens frequently, but if it does and your application scenario is critical, how do one knows if say the corrupted data is a minor color component in an image or frame of a movie, or a variable in a spreadsheet cell buried somewhere? Would anyone notice? If a bitflip happened in a location containing the target address of a jump then the software or the OS would likely bomb immediately telling there's something wrong somewhere, but (unfortunately) there are subtle errors that can go unnoticed until it's too late. ECC is meant to protect also from those.


This reminds me of a story told to me by a work colleague who began his career on Mainframes in the late 70s. One work place he was at paid several million for a major upgrade to their IBM mainframe. A guy in a white lab coat showed up, took the side panel off the mainframe, removed a circuit board and gave them a piece of paper to sign saying the job was complete.


It's not an ancient practice, I recall a case in the early 2000s where we got an expensive CPU upgrade to a mainframe which was not implemented by "a guy in a white lab coat" but as a one-line script (with a cryptographic key) to unlock more CPUs which were already installed on the system when it was first shipped.


There are Cisco Nexus switches today that require licensing to enable all the ports.


We had a Burroughs 6700, at some point (late 70s) we agreed to pay for what was called the "$50,000 screwdriver" which upped the clock speed - we wrote a contract where we would run a benchmark before and after and if it wasn't X% faster we wouldn't have to pay. We ran the before benchmark, the field engineer went into the cabinets to up the speed, came back looking sheepish, seems he'd regularly been turning the clock speed up during preventative maintenance sessions to make sure the machine was performing OK and at some point had forgotten to turn it back .... we didn't have to pay


Also I designed a Mac graphics accelerator gate array back in the late 80s, we were worried that the promised VRAM performance would not be fast enough so I designed in the ability to add extra clocks to RAS and CAS timings - it ran at full speed which was great, but marketing wanted a lower price-point version of the graphics card so we sold one with a ROM that turned on those extra RAS/CAS clocks


Hahaha, that's great :}


It's a very common practice, not just in computers. Many digital oscilloscopes (and probably other instruments too) can be hacked by just editing a file or flashing a new firmware. The purpose in scopes is making them behave as more expensive models with much higher bandwidth, which means that faster ADCs and circuitry is already there, but not used to its full potential so that the device can be placed in a lower market segment and sold cheaply.

https://nyctomachia.wordpress.com/2018/09/02/riglol-unlock-e...

https://www.makermatrix.com/blog/hacking-the-siglent-1104x-e...

..etc.


And if anything goes wrong, it will just halt during boot and that's a $4000 repair.


Make it at least an order of magnitude less; most of those models are in the $250-$500 range.


Probably true. I was thinking of a particular "major" brand. I do own a couple scopes of those lower priced brands.


Isn't this the same as SaaS thinking, albeit in the hardware era?

For the more common story of IBM enabling present but disabled memory, they couldn't magically teleport a new memory board into your mainframe. But they could ship your mainframe with extra, verified but disabled memory.

Customer UX for memory upgrades then changes from "Wait for it to ship, etc" to "We'll have it done tonight."


Yes indeed. These kind of things afflict enterprise IT all the time.

I used to run b2c teams which dealt with big banks. One bank's IT department used to bill internally for data center costs using RAM as a proxy for energy usage. A big spark cluster was purchased for the project I worked on and the minimum RAM for the spec of machine that was "standard" was 128MB, so to "save costs" they had the data center guy take out half the memory in every machine before racking it up. :-(


Tesla sold 75kWh cars to people who bought 60kWh cars. Then, during the wildfires here in California, they remotely unlocked the extra 15 for people trying to get out of town. If you wanted to get access to what you had legal title to, well, that's an upcharge.


With batteries it's a bit more complicated as LiIon degrades faster at higher charge levels / voltages. So you're trading longevity for extra capacity in that case.


IIRC they did sell people a software unlock for the other capacity which goes against the idea that it was designed to be like that.


Not the same thing. In the event of a life- or property-threatening emergency, it's reasonable for the company to temporarily remove the charge cap.

The batteries are backed by a warranty, which in turn is part of the cost of the car. It makes perfect sense to derate the batteries in order to lower warranty costs. Of course, we can debate whether the consumer ever saw the benefit of those lower warranty costs, but the core idea is sound enough.

As for "legal title," you have "legal title" to whatever the sales contract promised. If access to full battery capacity at the expense of service life wasn't in the contract, then you're not entitled to it.


As for "legal title," you have "legal title" to whatever the sales contract promised. If access to full battery capacity at the expense of service life wasn't in the contract, then you're not entitled to it.

That's definitely not how physical property works.


Really? Educate me.

I mean, there are implicit guarantees that don't have to appear in the sales contract, such as the requirement that the car will meet the applicable pollution-control regulations and won't actively try to kill me, but that's obvious enough. Did you have something else in mind, particularly vis-a-vis the batteries?


(Not a lawyer) Generally, at least in the US, when you own a physical object, you can do whatever you want with it that isn't explicitly illegal. This is at least partially represented in the first sale doctrine. For a car, you have a literal entitlement document showing that you own it. What you're not entitled to is any help from the manufacturer to do so, except that repairs can't void a warranty if those repairs didn't cause the problem being warrantied. With regard to batteries, there's no reason someone who owns a Tesla shouldn't be able to replace or upgrade the battery they own on the car they own.


Sure, but what does that have to do with what happened here? No one tried to hack the battery management system. The company modified it as a goodwill gesture (and to ensure that it didn't cause a PR disaster by functioning as intended.)


> > If you wanted to get access to what you had legal title to, well, that's an upcharge. > As for "legal title," you have "legal title" to whatever the sales contract promised.

You own the car. You own the battery. No sales contract changes that fact of ownership. They delivered a 75kwh battery into your possession. Since it's your battery, you should be able to use your battery however you want without having to pay Tesla. People are rightfully upset by a PR move that proves that the manufacturer is treating a car that was sold and delivered as if it still belongs to the manufacturer.


Business practices should have their own book of law, which says what is and what isn't acceptable.


Why shouldn't it be acceptable?

You bought a 60 kWh car. If it performs as advertised, you not entitled for more, no matter what hardware is inside. And if you find a way to exploit the full capacity of the battery, you can, it's yours, but don't count on Tesla to support it.

I have no problem with market segmentation. It allows more people to have access to a product while protecting the margins of the manufacturer. Because, without margins, there is no product.

And in fact, I am happy when I am traveling and the student next to me paid half of what I paid, I could afford the higher price ticket in exchange for more convenience, so no problem for me, but if the student had to pay the same price as I did, he wouldn't have been able to travel at all. I was like him years ago, and our children will probably be too.


The problem in this case is that you're dragging along 15kWh worth of battery weight, which is a waste of energy.


Either way, the capacity lowers as time goes on. Your 60 kWh battery might become 50 in a few years. The extra 15 is supposedly a buffer against that.

It’s the same thing with spinning hard drives and SSDs. The reported capacity is almost always lower than the actual capacity. That allows dead blocks to be “replaced” without the user even knowing (“remapped sectors”). Could you imagine the PR disaster if your SSD lost visible capacity as time went on?

You do have a point about the weight. The cost of shipping an SSD with 0% overprovisioning and one with 25% is negligible; the ICs weigh just a few grams. However, a car that has to haul an extra 15 kWh of unusable capacity does use extra energy. I’d be curious how much is needed to haul those extra cells.


Sure, but using 60kwh of a 75kwh battery lasts longer than a 60kwh battery would. Not to mention that Tesla's R&D to make a 60 kwh battery, an extra asmembly line (or one flexible enough to produce both), extra inventory, testing, maintenance manuals, etc all has costs as well.


Cars are often shipped with disabled seat heaters, CPUs are often shipped with disabled cores. As a customer I don't see why should I mind.

I understand in case of Tesla you don't like the idea of hauling those extra unused cells around. But you are buying a car model with a specific mass, acceleration and range. It makes no difference whether how exactly the producer fit into those parameters.


Isn't that just contract law?


The problem is that the law is too abstract and not sufficiently exhaustive to address all the disingenuous schemes used by businesses.


I have a similar story. A company I worked for in the 90's had a voice mail system that ran out of storage space for messages. When we upgraded our contract a technician came on-site to expand our storage. I was curious how it worked so I watched over his shoulder; he physically removed a bolt from the hard drive. I asked about it a bit and he said that the bolt had prevent the drive from accessing the entire drive. My mind was blown.


IBM field engineers had a special dongle key that could unlock hardware features that were already present on the hardware, for a price


What did that circuit board do?


Might have added some capacitance or something to the system clock, so that it didn't cycle as quickly.


Hardware-based sleep()? :)


> We were sceptical of that assertion because making the effort to add code to the Linux kernel without planning for it to be used is thoroughly counter-intuitive

I mean... this seems like a very good point, but from the rest of the article, and the previous one, it sounds like the patch was approved?

Why was this allowed into the kernel? looks like drivers/platform/x86/intel/* mostly. Does intel just "own" that section of code? who accepts these patches?

I'm not particularly familiar with this stuff so would appreciate someone filling me in on this.


> I mean... this seems like a very good point, but from the rest of the article, and the previous one, it sounds like the patch was approved?

Intel actually said, in a short snippet where The Register refers to them as "Chipzilla", "If we plan to implement these updates in future products we will provide a deeper explanation of how they are implemented at that time.". One interpretation could be they are saying they haven't figured out if they are going to implement this in anything so they just added it for fun... I'd call that the "makes a good article" interpretation. The other interpretation is they plan to use the feature but have not yet finished planning which products will use it to unlock what and will talk about it in more detailed once planned, I'd call that the "corporatespeak decoded version". If you were a literal interpretation bot I would still say the corporatespeak version fits better than the article filler version hence why the extrapolations with that version don't match up to what is happening.

> Why was this allowed into the kernel? looks like drivers/platform/x86/intel/* mostly. Does intel just "own" that section of code? who accepts these patches?

I didn't look if it was actually accepted yet or just a proposed patchset but generally Intel will be the maintainers for Intel code (but not always) and they are one of the larger contributors to the kernel overall. You can see more details about maintainers and reviewers (as well as maintainers for sections of code) here https://www.kernel.org/doc/linux/MAINTAINERS#:~:text=One%20j....


Thanks for that maintainers link. That's exactly what I was looking for.

It seems that a lot of the changes in the patchset are in the intel pmt driver, which is maintained by the author of the patch, so that makes sense why the maintainer included it.


I'm not an expert but: you can look at the log for any given file and see who authored what commit, who reviewed, who actually committed. For example, a different change by the same author [1]

And you can search the mailing list[2] by commit hash/author name/etc and look at relevant discussion trees to see lively patch discussion and criticism/final pull requests/etc.

I don't believe the changes in question have been merged.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[2] https://lore.kernel.org/lkml/?q=David+E.+Box


I don't think that this is really new for Intel so they are very likely to use this functionality. They've done something similar to Pentiums about 11 years ago: https://www.engadget.com/2010-09-18-intel-wants-to-charge-50...


There was at least one IBM machine that had a similar "feature".

It could run in slow-mode or fast-mode. Always-on fast-mode cost more per month than slow-mode. (Yes, these machines were leased, not sold.) However, you could get "fast-mode" on an as-needed/hourly-basis, for a "small" fee of course.


The precursor to cloud provisioning, I see.


Whatever happened to having chips that didn't have all this secretive scummy nonsense behind it? I feel like we should just roll back to the 68k era at this point.


ACPI is to blame because it specifically enabled laptop manufacturers to hide platform hardware details behind a shitty firmware interface instead of providing OS drivers/hardware details to driver developers, and it was around that time that SMI became a thing on Intel CPUs and platform firmware started to use that for fan control, instead of depending on the OS to do that.


ACPI is a huge part of the reason that linux works at all on x86, and in many cases is as popular as it is. If it acted more like arm boards where pretty much nothing works right, no one would use it for general purpose computing. There are literally hundreds and hundreds of drivers for PMICs, clock dividers, GPIO pin muxing/etc for the hundred or so boards that sorta work, and each one needs customization that takes months/years every time something gets tweaked on the board/etc.

ACPI allows random x86 board vendors to literally have hundreds of products themselves that not only boot linux, but a pile of other OS's because they aren't wasting their time writing drivers for every single board and voltage regulator.

So, call it shitty, but understand that its the reason you can boot a single linux image on everything from an atomicPi, to HP superdome flex, and the thousands of desktops/etc being customized by people in their bedrooms.

Its also largely the reason why linux works on 20 year old PC's that no one is actually testing on anymore.

The alternative is a monoculture where the HW is basically provided by a single vendor that doesn't change much and provides a half dozen supported configurations. I would point to the zseries here because that is effectively how it works, but it also has a huge hw/firmware abstract machine that allows IBM to change the underlying HW details without having to rewrite all this low value garbage.


> There are literally hundreds and hundreds of drivers for PMICs and voltage dividors, GPIO pin muxing/etc for the hundred or so boards that sorta work, and each one needs customization that takes months/years everytime something gets tweaked on the board/etc.

There's literally tens of thousands of devices for the PC platform and Linux has no problem supporting them if the hardware can be documented. Why can't the vendor document that instead of hiding this stuff behind opaque binary blobs that can have security vulnerabilities? Whether this is ARM or x86 doesn't matter.

> ACPI OTOH allows random x86 board vendors to literally have hundreds of products themselves that not only boot linux, but a pile of other OS's because they aren't wasting their time writing drivers for every single board and voltage regulator.

I'm parsing this as "ACPI OTOH allows random x86 board vendors to develop shitty hardware, cut corners on documenting it, and not care as long as Microsoft OS's boot on it."

> they aren't wasting their time writing drivers for every single board and voltage regulator.

Document the damn registers and hardware interfaces and then someone out there will write it for them.

> Its also largely the reason why linux works on 20 year old PC's that no one is actually testing on anymore.

Well Linux still supports hardware that predates ACPI, like floppy drives, so this is making a connection where there is none.

> The alternative is a monoculture where the HW is basically provided by a single vendor that doesn't change much and provides a half dozen supported configurations.

Wrong. Before ACPI, for example, there were many chipset vendors--Opti, VIA, etc. ACPI didn't kill these off but there wasn't a monoculture before ACPI.

> I would point to the zseries here

So ... do we want an IBM-mainframe like monoculture where you have to depend on IBM when the hardware changes?

> allows IBM to change the underlying HW details

The job of abstracting the hardware details is the operating system and operating system drivers. Literally that's 50% of the reason why you have an OS in the first place is to abstract I/O into things like open(), read(), write() or other interfaces that are portable because of underlying drivers. If the hardware is so different that existing I/O calls can't handle it, you need to develop new ones - that's what happened with Berkely sockets - NICs are not block or character devices.

If the drivers are open source then they are bug-correctable and useable even well after the hardware manufacturer goes away. Embedding that in closed-source platform-firmware makes you dependent on that platform manufacturer. A strong contributor to the Wintel monopoly.


  Wrong. Before ACPI, for example, there were many chipset vendors--Opti, VIA, etc. ACPI didn't kill these off but there wasn't a monoculture before ACPI.
That is where your wrong, x86 PC's had BIOS and APM which also provided minimal platform abstractions. But there was a monoculture, you either provided PC/AT HW and BIOS compatibility or your x86 didn't work. There were HW "standards" for everything, be that CGA/EGA/VGA, or IDE controllers. Yes, you might make your own video card, but its absolutely supported those standards. Similarly with chipsets, there were closed source early boot firmware, but by the time the MBR was being loaded it looked like a 1980's PC/AT.

As far as there being enough driver developers to further fill the kernel with all these drivers belies a fundamental misunderstanding of how complex even those arm machines are behind the scenes, even the ones that work likely have huge stacks of binary code running places that aren't visible to linux/DT. You need only look at some of the open or reverse engineered arm boards to see that. the RK3399 specs were published, what 4 years ago at this point, and there are still rk3399 fixes landing. Should it take 5+ years from the release of a piece of hardware before linux can boot and work reliably on it?

edit: See https://en.wikipedia.org/wiki/Option_ROM for where all that binary code used to hide in the 1990's when you plugged in random "vga" boards and storage controllers.


The original idea behind BIOS was to provide drivers (and a firmware interface to them) so CP/M could run and also read the first disk block into memory so CP/M could load.

This was awesome when your platform consisted of an 8-bit CPU, a serial port or two, a printer port or two, a disk drive or two, and a text-based video display.

ROM and BIOS is a poor place for drivers if

- your system has any notion of plug-and-play at all

- your system has expansion slots and arbitrary people you don't control might develop hardware for it.

and these two things above are desirable if you want a free-ish computing platform not monopolized by one company.

Hardware interfaces are not the same as the firmware gunk ACPI foists upon you. Hardware interfaces are simply a way for the CPU to talk to a device outside of the system, but it doesn't result in the CPU running unknown code behind your back. A CPU that's cordoned off behind a peripheral interface running closed-source code is fine - where that's not fine is on the same CPU that's running my OS kernel.

> Should it take 5+ years from the release of a piece of hardware before linux can boot and work reliably on it?

I mean if the hardware manufacturer won't document their devices it's something they bring upon themselves. Embracing the open source community here would have substantial benefits unless something like market segmentation is taking place.

Option ROMs? Yeah, those are BIOS extensions - your OS isn't dealing with those ROMs once the BIOS has booted the OS unless it's a CP/M era operating system - like actual DOS. Except for the modesetting - but that's just as much bullshit as ACPI. Document the damn registers that do the modesetting change so we don't have to thunk back into 16-bit mode just to change the screen resolution.


Uh, option roms are still a foundation of how PCIe/etc work. So, unless you don't consider PCIe to be plug and play, option roms are very much a part of PnP. Ever plugged in a GPU? How do you think the BIOS/EFI/GRUB/etc display things on the screen? How about net booting of a random network adapter, or storage controller that isn't AHCI/NVMe?

So, all that said, your mental model of how a modern machine works, seems like its stuck with the idea that linux is the center of the machine and can access all the hardware, and that the HW looks like a 1980's PC with "registers" that actually modify HW states. Which is provably false, and will continue to be that way as long as people want inexpensive and/or high performance machines. The mainframe guys go on about channel processors, but that concept (using a small cpu/etc to manage a piece of hw or communication) is fundamental to a very large percentage of HW produced in the past couple decades. There is code buried in pretty much every single USB device to manage the bus, as is true for nearly every storage device where those microcontrollers manage everything from queue scheduling, signal processing, etc on spinning media, to the flash translation layers and error correction on SSDs. Then, there is all the power mgmt code running to control internal bus power/frequency, as well as cache power mgmt, etc, etc, etc. Even things you probably think are simple register model HW devices (say XHCI or NVMe) have microcontrollers buried in them actually driving the bus and maintaining connections. Overwhelmingly what you think of as HW registers are actually mailbox interfaces to micro controllers running proprietary firmware.

So, as I mentioned pretty much the entire HW docs for the RK3399 were released years ago, and that SoC and the dozens of boards you can find it on, still in general won't work out of the box with a random upstream linux kernel on any random board you can find. And when it "works" the power mgmt tends to be terrible. That is because it actually takes engineering time to make these things work well, someone has to write the drivers, device tree's, etc and go through the pain of getting them merged to mainline. And its very obvious that while sometimes there are people willing to spend their holidays and weekends making it work, those people are few and far between. And that's just for a few pieces of HW, if there were hundreds of manufactures making variations on say the pinebook pro, pretty much none of them would work outside of the hacked up debian/whatever that the manufacture shipped on the device.

And then, say linux actually works well on it, what happens if you want to run netbsd?


> Ever plugged in a GPU? How do you think the BIOS/EFI/GRUB/etc display things on the screen?

That's because the BIOS is not an operating system--or at least used to not be. And it goes back to the CP/M architecture that's even older than the 1980's PC. Option ROMs are there for the BIOS and single-tasking "operating systems" that use it CP/M style. Modern operating systems don't need the layer there.

I have a Guruplug (ARM platform) that boots Linux and what's in flash is U-Boot - a bootloader that loads Linux, the initrd, then it gets out of the way. The only reason why the PC platform can't work like this is because of ACPI.

> Overwhelmingly what you think of as HW registers are actually mailbox interfaces to micro controllers running proprietary firmware.

Oh I know. SATA is a communications interface (as was IDE, ATAPI, SCSI). USB is a communications interface. NVMe is a elaborate tagged/queued communications interface. Et cetera.

I'm not sure why you are conflating CPU-facing hardware interface (registers) with anything that is physically peripheral side stuff except for this which I will address:

I know a CPU I don't control and don't know the code it's running is on the other side of those links. That's fine--because it can't directly access my OS's RAM unless the OS allows DMA - and it's probably going to cause a device-level issue instead of a machine-level issue if there's a bug in that firmware.

Not seeing the value add--other than saving some poor, poor Microsoft-aligned platform firmware developer's bit of time--of having to jump to a closed-source firmware routine to use those communications interfaces instead of letting the OS directly talk to them.

> So, as I mentioned pretty much the entire HW docs for the RK3399 were released years ago, ...

So what makes the PC platform avoid this mess is not the presence or absence of closed-source firmware running on the main CPUs, but simply the hardware platform itself being standardized. This is because people copied it from IBM. The BIOS was copied because DOS needed it, not because the hardware needed it other than a CPU requires ROM at it's initial boot address. There were well known addresses for each device, such as DMA channel 0/1, IDE channel 0/1, FDC 0/1, serial port 0/1/2/3, paralell port 0/1/2/3. I really want to know why we couldn't have a hardware standard for laptop power control interface instead of APM. The industry managed to settle on PCI in reaction to IBM's attempt to grab back the platform with MCA. PCI doesn't requre firmware, and it's registers to scan the bus are well known and standardized in hardware. So why did everyone say it was OK to hide the power-controlling hardware behind APM? DOS of course needed something like that but you had more than a couple commercial operating systems on the platform besides that (Xenix I think, OS/2 likely still kicking a bit, NT).

Early 90's when APM started taking hold was also the time when Intel started to not publish certain things in its CPU manuals.

Device trees should be easily obtainable from device datasheets.

Overall the real problem with the ARM boards is an economy that values time to market and treat-your-first-X-customers-as-beta-testers over quality. Separate issue that we really should be unwilling to accept as an absolute requirement for unauditable code running on the same CPU my operating system is running on.


The option rom frequently is required to init the hardware so the OS can view something normal. Modern PC's aren't using 16-bit int services, rather UEFI GOP drivers sourced from option roms for early display, without which you can't interface with the machine until the OS starts. If you want something like a phone where you can't replace the OS, that is how one goes about it. Even uboot has noticed that standards matter as they slowly transform themselves into yet another UEFI firmware.

I have a guruplug too, or rather an openrd, because all the guruplugs died. But that is 1980's PC level of simple HW, and its not SMP, doesn't support virtualization, or any power mgmt to speak of. The list of things it can't do is longer than the list of things you can do with it compared to a modern piece of HW.

Modern PC's with ACPI aren't "standardized" outside of a few interfaces being used by the OS, that is my point. All those regulator drivers, clock controllers, pin muxing, I2C'ing, SPI'ing, to manage the platform is whatever the vendor put there in PC land (and arm land for that matter) is still there, only the OS doesn't have to worry about those details because it simply asks to power something on, and it happens, and some other processor takes over picking perf profiles and idling links/etc when needed. If there was a standard PMIC, wired up in a standard way it might make sense to attempt to use it, but there isn't. Instead its a bucket of parts wired up in every way imaginable.

ACPI, doesn't dictate what is on the other side, it could be a SMM trap, or it could be a mgmt processor, or a BMC, or the function can be completely written in AML. That is the point, its just an OS API surface, it could just as well be a standardized pile of HW registers but that would remove the ability to run in cases where the vendor wants to save a penny and run everything on the main core as you suggest doing with DT. A large part of why you provide these interfaces is because having the main core waking up to fittle with some 100Khz SPI bus to flash a led is dumb, wastes power, and eats perf doing work that would be better handled with a small mgmt core. Your basic argument seems to be that you don't want the OS wasting cycles in the firmware, but your perfectly ok with the OS wasting even more cycles all the time doing these functions sub optimally in a kernel that doesn't understand the intricacies of power management on the most pwoer hungry core in the system. You seem to think ACPI is always just a SMM trap, and that isn't the case. The reason I conflate is its because in the case where the remote is a BMC/etc your effectively just poking a mailbox to get some other piece of hardware to do the work and those mailbox interfaces are no more standard than PMICs, so now you have thousands of them needed to talk to battery mgmt controllers, power/standby buttons, you name it. Instead of standardizing all that garbage, a software API was created. Its no different from openGL or any other standard except that it uses a bytecoded function call interface (ala openfirmware's forth). Do you hate openGL too because your game isn't fittling fake HW registers?

And now that I point out what happens when a vendor just tosses those register maps you want over the wall you change the topic to how they are just creating beta level products while refusing to acknowledge that someone has to do the work. From the perspective of a vendor it costs a tiny fraction to have some closed source firmware that works across dozens of OS's and allows them to redesign their hardware vs hiring experts in a half dozen OSs to write piles of custom power mgmt code accessing dozens of drivers talking on dozens of SPI/I2C/mailbox/etc interfaces for a single machine. Its actually a little bit crazy that people in the arm space are trying this while simultaneously complaining about the HW vendors creating piles of patches that they can't get upstream so they build their own custom linux forks. Its the natural result of making the same claims your making. A vendor can either spend years fighting with kernel maintainers, or they can fork linux and ship it to their customers with a pile of patches that only work in linux. The middle ground is where a number of them have been going, its the rpi's proprietary mailbox interface that talks to the videocore to set the processor frequency. Multiply that mailbox by a few dozen SoC providers and you have the future of DT on arm and risc-v. In a decade or two they will end up standardizing the mailboxs and reinventing ACPI and openfirmware in order to support cross platform mailbox interfaces.


Related: https://news.ycombinator.com/item?id=29302045

Intel Software Defined Silicon: additional CPU features after license activation


Binary blobs are a security threat, a right to repairs threat, an ownership threat.


The last one is the scariest. No doubt this is all part of the "you will own nothing" movement, which some people are still dismissing as a mere conspiracy theory when there's tons of evidence for what is really happening.


Intel seems to be on a mission to slide into historical irrelevance. Rome didn't die in a day and neither will Intel. I think this gives new meaning to planned obsolescence.


Intel and AMD don't build features no one asks for.

TPM and SecureBoot came directly from large customers requesting exactly that.

HN (and more broadly, PC users) aren't all of the market for CPUs.


The "TPM" you know today started life as Microsoft Palladium, an attempt by Microsoft to create a hardware DRM implementation for "content producers" that could tell the OS what was allowed and not allowed on your computer. The assumed goal at the time was that once the frog was sufficiently boiled it would be used to enforce Windows licensing as well. After much backlash they diverted their efforts into a cabal with Compaq, HP, IBM, and Intel to make it more general purpose and actually allow the end user to store keys.

Secure boot was mostly just a toy until Google implemented it in Chromebooks and the large customers you speak of pointed and said "I want that!"


You seem to imply customers demand bridled CPUs. Thinking about it, there is one scenario for which their "sw-defined silicon" (hello newspeak) would both make sense and not be too much of a fuckery: instead of their regular process that now consists of shipping unvalidated HW and waiting for complaints about bugs for triggering chicken bits in the next microcode, they could ship processor with only the properly validated parts enabled at first, and make people pay for extra perf or features when the remaining validation is actually completed and if it passes.

I do not really believe this is their intent thought: people are way too much continuing to act like mean 5% better perf on arbitrary benchmarks has any significance (even regardless of power consumption). So Intel will likely continue to ship broken but allegedly fast processors, and will continue to kill the perfs to remove the bugs once they sold enough.

Plus, in a competitive market, it would be an equilibrium hard to achieve if competitors are not making pay for their own deferred enablement.


There definitely are major customers (for example, cloud providers) which explicitly demanded "bridled CPUs" like TPM and SecureBoot. It's not not something imposed by Intel, those features are required by an important part of the market, ignored by another important part of the market (unsophisticated consumers), and opposed by an small part of the market which is highly represented among tech and privacy enthusiasts here on HN, but is not really significant in size.


Yep. Being able to buy processors that don't trigger Oracle's core-count licensing (for example) but be able to vary cores on and repurpose them in the general VM farm is a thing that I have been very keen on in a past life.

It's a relatively niche interest, but it exists.


> cloud providers) which explicitly demanded "bridled CPUs" like TPM and SecureBoot

You seem to be terribly confused. End customers demand features, they do not demand limitations.

AWS and GCP don't want to spend money acquiring or powering chips with dead areas of silicon.


"Dark Silicon" is actually ubiquitous in modern chips. They can't possibly run 100% of the chip while staying within power and thermal constraints - and this gets worse, not better as process technology improves. So, even if some of that dark silicon is "software defined" it just doesn't matter all that much.


End customers demand features for a price.

If there's unused silicon in a chip, but it costs the same or cheaper and has the features they want, I think you'd be hard pressed to find a customer who's going to complain.


If it's possible, customers will buy as many of the cheap ones as they can, at the lower price, and aftermarket enable the unused silicon to get a better product, for less money, which may make the business people unhappy. This happened with some GPUs where they could be "upgraded" to be a better product with a soldering iron.


I'm familiar, I've gotten a few free AMD cores. https://docs.google.com/spreadsheets/d/19Ms49ip5PBB7nYnf5urx...

At the end of the day, I don't understand the umbrage at companies segmenting by software instead of hardware. As long as it's obvious what you're getting when you buy, and what you buy lives up to what you were promised, why should I care what it actually is?

If the company chooses to look the other way and allow relatively easy unlocks without promising stability, well, that's nice. But nothing I'm owed just because it was physically on the chip.

Now on the other hand, if a consumer company switches from a product model to a lease/HaaS model... that's an entirely different can of worms, and they can go to hell.


Great. Where can I get a CPU where that stupid Intel ME or AMD crap is just permanently fused off, and even pay less for it?


You are paying less for it, in retail CPU prices that are contingent on the same model with ME features being sold to integrators and enterprise organizations who use it.


Your computer won't boot if they are fused off. They are a necessary component of the CPU.


> Intel and AMD don't build features no one asks for.

Hmmm. Intel has built whole ISAs that no one asked for and didn’t want.


If you're referring to IA-64, then what information available in 1994 would have made you believe that VLIW / EPIC wasn't a viable path to provide >1 IPC at competitive clock speeds? Something very many customers were asking for and wanted.

Some good ideas work. Some good ideas don't. Doesn't mean they were bad ideas, only that unknown at the time factors or future developments made them bad ideas.


> what information available in 1994 would have made you believe that VLIW / EPIC wasn't a viable path

The i860. High peak performance on paper, realizable in approximately zero real-world code.


I wasn’t thinking of IA-64 actually.

Not sure how the rest of your comment relates to my point though. Customers may want better performance. Doesn’t mean they want your new architecture. They certainly didn’t want IA-64. That’s just a statement of fact.


Which ISA were you talking about?

And as for the link between architecture and performance, I'll turn the common quip: "You can have performance or architecture changes, pick two."

The debate between runtime parallelism vs compiler parallelism, and which would result in greater performance on real world workloads, was an open question at the time.

As it turned out, the market preferred answer was "Screw it, we'll push superscalar and add more cores." But that's non-obvious in foresight. See: the famous P68/NetBurst/Pentium 4 vs P6+/Pentium M struggles.


I broadly agree that it was non obvious and don’t really blame Intel for trying IA-64.

I was really (semi humorously) trying to push back against the parent comment saying that everything these firms have done has arisen directly from customer demand. Obviously firms must think that there is demand for new products or features but sometimes they just misunderstand the market.

To give another example was there really any demand from customers for x86 cores in smartphones or was it Intel just trying to establish a market presence?

To answer your question iAPX432.


iAPX432 is a good example. :-)

But in the early-80s, Symbolics was making a lot of noise with Lisp machines, the late-80s AI winter hadn't yet set in, and there were certainly worse bets than CISCy "we need to integrate up the stack" ideas.

x86 on mobile is a hard one. I see why Intel did it: it's what they had. And I'm sure Microsoft was whispering some demand in their ear behind the scenes.

But it really only made strategic sense pre-App Store volume (so say pre-2010?). Once the mass of code existed on ARM and built for iOS, that genie wasn't going back in the bottle.

As far as I've heard it told, that was more of a financial business decision though. Intel wasn't willing to cut its margins, because it couldn't see trading volume for margin as mobile exploded.

If they'd been able to offer the market cheap, sufficiently performant, power efficient chips for mobile, I think present would look a lot different.


Well, it's not for lack of ambition. Their roadmap has them back on top with 18A in 2024. They also have plans to make a zettascale (yes zetta) supercomputer by 2027 +- 1 year. As much as intel is known for stifling progress, so have all (most/many?) encumbant monopolies. I hope they stay close enough to AMD, and vice versa, for healthy competition.


the 12th gen CPUs are quite close to AMD on performance; perf per watt obviously depends, but it used to be way worse. don't assume intel is dead; AMD was assumed dead back in bulldozer years.


can someone kindly ELI5 what exactly is software defined silicon ? is it another term for FPGA ? or is it something more capable than that


Back in the 1990s you could buy a big IBM mainframe with 8 CPUs, and they'd ship you one with 16 CPUs and if you decided to upgrade later, they just sent a guy to flip a secret switch.

This was never popular with the Free Software crowd, who believe in the principle that the owner of the hardware has the right to do whatever they like with it. They would say you have the right to flip the secret switch yourself.

This is the same idea, but updated for the 21st century. My laptop came with hardware support for virtual machines - In a few years time, perhaps features like hardware-accelerated virtual machine support will be separately licensed premium features, instead of being stock features.

Some companies already do this - for example, some Teslas have autopilot and long range batteries, but they're disabled in software. And if you buy a Tesla with those features second-hand, Tesla can disable them until you pay them.


> And if you buy a Tesla with those features second-hand, Tesla can disable them until you pay them.

Everything I've seen says this is not the case. As I understand it, the story in the news about "something like this" was misleading. The car went through a dealer and was sold certified pre-owned; and it was sold as not having those features. It turns out they forgot the disable those features. I you buy one from a third party that already paid for the feature, it stays with the car.

At least, that's the understanding I was left with.


> The car went through a dealer and was sold certified pre-owned; and it was sold as not having those features.

Was the car sold to its original owner with those features enabled? If so, it's effectively the same as the original story reported, meaning that Tesla expected to be paid twice for the feature.


As I understand it,

- Car was sold to original owner, who enabled the features

- Car was sold back to dealer/Tesla (presumably a trade-in?)

- Car was sold to second owner, with information that it did not include the features

- Turns out the car had the features enabled (they mistakenly forgot to turn them off)

- Features remotely disabled

There's nothing inherently wrong with Tesla charging both the original and second buyer for the features, as those features make the car worth more. Conceptually, its the same as paying more for a car with a better entertainment system, then the person who buys it next (from the dealer after you trade it in) pays more for it than they would for one without that system.


In the story that made the big splash, after

> - Car was sold back to dealer/Tesla (presumably a trade-in?)

a used-car dealer bought the car from Tesla and was given (legally required) documentation about the car, which indicated the car had the features. Then Tesla removed the features remotely while the car was owned by the dealer, without informing them (because they decided during an audit that it shouldn't have had them). The dealer then showed the documentation to the customer (as required) and sold the car, both sides in the understanding it had the features as documented (because why wouldn't it, if Tesla documented it as such). The final customer then found the features disabled (or being disabled on an update/sync).


> if you decided to upgrade later, they just sent a guy to flip a secret switch

So what prevented you from finding and flipping the secret switch yourself? Or is it that merely being a business you couldn't afford to take this risk and lose IBM support?


Lawyers, an army of lawyers.

You don't want to get letters from Oracle, IBM, Intel and Co. because you infringed some of their EULAs. They may be invalid, but it will be a very costly decade long court case. And the other part will pump thousands of times more money into it than you can afford, because when they loose they will loose a big income stream in the future.


So, free upgrades in a couple years when this inevitably gets broken in to?


I don't think it's inevitable that this'll be broken into. Public/private crypto in high end process nodes has a pretty good track record.


Modern Nvidia cards detect you're trying to mine for cryptos, and artificially limit the performance of the silicon.

It's the opposite of being capable - it's silicon that refuses to play to it's full capacity lest you buy an additional license.


That's just nuts. Sell me at what price you will but after I buy it, you should not control a single bit on my hardware.

It's like we are going backward in progress.


Are they doing it to get a bigger cut of miners' funds? Or to try to find a way to keep the pipeline for designing, building, and coding for gaming GPUs flowing?


I think on the whole, Nvidia isn't thrilled about crypto mining.

It made them a lot of money, but they (rightly so) look at it as margin that could evaporate tomorrow. And in the mean time pisses their core users off (through volume availability issues) and makes an already difficult production -> retail supply chain even more bursty.

So how do you deal with that? Well, a not-terrible way is artificially segmenting your market. Crypto folks, over here in this pool with huge waves. Everyone else, back to business as usual.


Nvidia isn't thrilled about crypto mining because it creates too many second-hand cards, that's it. They just want to kill the second-hand market boom that happened as crypto mining has high hardware turnover, and gaming doesn't. So gamers just pick up second-hand previous generation high end cards instead of buying current gen mid-range cards.

It's not a coincidence the software lockout landed right as Nvidia released an unlocked mining card without a display output ( https://www.nvidia.com/en-us/cmp/ ). And since it doesn't have a display output, all those used mining cards become ewaste instead of something that cuts into nvidia's revenue.


Exactly. If Nvidia removed the crypto crippling, almost all of their GPUs would be bought up for crypto mining leaving their gaming market empty. A new entrant would come in and take up their gaming market. Meanwhile, Nvidia's production (and capital expenditure with it) could ramp up dramatically to meet crypto demand, until it too collapses. Then they are left with nothing but debt.


> A new entrant would come in and take up their gaming market.

Miners would probably find a way to use that new entrant's GPUs for mining too.

I think the real threat is that mining could chill the gaming industry as a whole. Either by pricing tons of people out of gaming so they find new hobbies and maybe never come back, or by influencing the direction of game development away from advanced graphics. If most gamers can't afford top-end GPUs anymore, then the perceived value of [Latest Game] having the most photorealistic graphics yet could be flipped into the negative by a change in gamer culture.


Or they just buy PS5 or Xbox (both have AMD chip) and never go to PC gaming.


Jokes on you, consoles have also been out stock since forever.


They sell another GPU specifically for mining, at twice the price.

https://www.pcgamesn.com/nvidia/CMP-170HX-price-double-RTX-3...


> Are they doing it to get a bigger cut of miners' funds?

No, they are trying to piss off the miners so that their actual customers (gamers) have a chance of getting GPUs again.

It's like the first Bitcoin craze many years ago when the first CUDA miners came out - you could not find any GPU anywhere because people bought them for mining. Gamers were pissed as the miners outbid them everywhere. Once ASICs came out, the complete GPU market crashed - suddenly there was a massive supply of second-hand GPUs from miners... the gamers snatched these up, and NVIDIA suddenly had the problem that their new GPU sales crashed.

The only difference is that this time it isn't Bitcoin anymore but a whole bunch of random shitcoins.


This has been the norm for a long time - the AMD Hawaii GPU was sold as the 290X to gamers with FP64 artificially crippled to run much more slowly than normal, and then sold at a much higher price to other richer customers who use them to make money as the Radeon Pro series (in the case of Hawaii - it was the FirePro S9150 and S9100). Traditionally it has been FP64 performance, as well as things like looking at the OpenGL calls and artificially limiting performance when something looks like a CAD workload (see: the big increase when NVIDIA removed the limitations on Titan X professional workloads after Vega FE came out).

Somehow nobody shed a tear for the poor enterprise customers until miners came along. But that's what it is, miners are enterprise customers, they use them to make a profit.

Similarly, almost all R5 consumer processors are made by artificially disabling fully functional cores on R7 chiplets. AMD already had nearly 80% of the chips coming off the line with 8 functional cores at Zen2 launch, and numbers have only gotten better since there. Yeah, maybe a few fail clock bins or other things, but those bins are chosen very loosely such that they're not throwing away any significant amount of chips anyway. The overwhelming majority of R5s are perfectly good chips being gimped and sold as lower chips, because the availability and demand are exactly opposite of each other - they have the highest availability of 8-core chiplets and the highest demand for 4- and 6-core chiplets. How do you square the two? Not by lowering prices of your high-margin products - you keep those high to extract the maximum price from price-insensitive consumers, and you artificially gimp a bunch of them for the rest of the market to protect your profits in the higher segment.

https://en.wikipedia.org/wiki/Price_discrimination

The thing to remember is that home users are almost always the beneficiary of market segmentation, because we are the most price sensitive. The alternative to Core-vs-Xeon-E segmentation isn't that everyone gets Xeons at i5 pricing, it's that the i5 now costs almost as much as a Xeon. The R&D has to be paid back one way or another, or else R&D will be significantly reduced (most likely meaning much longer product generations and slower moves to new nodes, etc). Price discrimination means enterprise customers pay more than their share and home customers pay less than their share, so if market segmentation goes away and those equalize then home customers will have to bear more of their fair share of the product cost.

Again, who wants to race to pay more for their gaming graphics card so that Boeing and cryptominers can pay less? It would be great if everything were free, or priced at its true cost, but in the real world R&D needs to be paid for somehow.

Oh, and the other thing is Enterprise customers actually love this, it's called Capacity on Demand and I'm sure the reason Intel is doing this is because their customers keep asking for it. Sun, Oracle, Fujitsu, IBM - all the big enteprise vendors will sell you a machine with a bunch of cpu and memory that you're not allowed to use or touch! But if your workload scales up, instead of having to shut down your mainframe to upgrade it, you can just pay money and turn the cores on! And sure, it would be nice if all software were just microservices that you could re-instantiate onto another machine, but some stuff (like databases) can't be trivially scaled and partitioning (read: stale data reads) can't be tolerated. The only surefire answer to CAP theorem is One Big Machine such that you never have to deal with splits in the first plac, and that's what some situations need, and they love this.


We had the same thing in software and in media content.


We are.


Who’s “we”? You are going backward. NVIDIA’s profits are going forward.


Justify your statement that the poster is «going backward». Clearly the poster means, "in terms of values".


Technically all 30 series video card came with virtual gpu support no matter it is consumer one or professional one. Just they are disabled by default and need a secret activation code from driver.

Someone cracked it and enabled them with custom driver. (And of course nvidia patched it out in the latest vbios to make the bypass impossible).

There is the story of this on HN few month a go.


What will this practice mean on airgapped machines? Probably that their hardware will only work at a fraction of the capabilities. But not necessarily only that: unlocking hardware features my mean, depending on implementation, that at some point in time a network connection is assumed.


You should install the license the same way you install other software.


> the same way you install other software

You mean, completely anonymously, and without any association of a machine to an owner?

I install software through `adb install software.apk`, or through `make`, or similar, of packages stored on drive...

I am not sure this can be the model, for this new "unlocking hardware features" scheme. Many will just say, "No "thank" you" - and I am not sure about all the practical consequences. I am even more concerned that this facilitates preparing the way for taking for granted some practice like "register after purchase to be enabled for using the product", which I am informed some already attempted.


People are acting like this is being implemented in consumer CPUs. I don't see any indication of that. This is likely for people who need powerful on-site compute, but can't afford (or don't yet need) the most powerful version of Intel CPUs. This is a way to shave down the price tag, by letting people buy the most powerful CPU, but restricting it down to what you actually need/can afford, so Intel can sell it to you at a subsidized discount.


(Also, upgrading CPUs costs labor, so having the CPU "pre-upgraded" is another cost savings)


Is amd the same about adding secret blobs to their hardware?


Yeah, the PSP with its opaque firmware even handles DRAM initialisation on Zen processors, so that you cannot run without it at all.


A lot of DDR initialization code is heavily patented and often considered a “trade secret” which is why it’s rarely released and publicized. It’s not that complicated if you know hardware and work with the official internal data sheet. I’ve seen those as part of partnership between CPU vendors and my workplace. You get to read them on a special Remote Desktop so you don’t copy them. Same for ME. Once you know what’s actually in the code the conspiracy theories do seem funny :)


Marvell has their DDR3+4 initialization straight on GitHub: https://github.com/MarvellEmbeddedProcessors/mv-ddr-marvell (even this includes a Synopsys blob for the PHY, but only on some platforms — no blobs on A70x0/A80x0 thankfully)

It's really stupid that other vendors can't just do the same.


The conspiracy theories are funny, but the bullshit situation of being locked out of things you yourself OWN isn't. I think that's what most of us hate about things like ME or PSP.


ME/PSP isn't that offensive in terms of "locking you out", it's just a coprocessor that's there. Intel Boot Guard on the other hand can actually prevent custom firmware from running.


Being a TPM is one of the Intel ME and PSP’s functionalities. Making a TPM that’d be secure against custom firmware is possible, but isn’t very implemented across the industry.

It’d need measuring the hash of the loaded firmware itself to hardware managed PCRs that affect the hardware crypto engines. Something that’s done on the Secure Enclave for Apple A13 onwards AFAIK.

(also… managing DRM systems, they’ll have to be put on a coprocessor somewhere.)


It's not a wild conspiracy theory that there's an unlabelled "High Assurance Platform" bit on Intel platforms to disable the Management Engine. The same sources that identify the bit as "HAP" control claim it was requested by three letter agencies. I set it, and it does disable/debilitate the ME, so that much is true. That the option exists at least suggests that the ME is an attack surface worth worrying about for someone, i.e. state level actors, even if it's not known to be exploitable.

I assume you meant to be a help by assuring people their computers aren't backdoored, but I don't think you have the certainty to claim ME is sound and secure. It just reminds me of how well-meaning people insisted the USG wasn't interested in capturing everyone's communications, back before it was accepted fact, pre-Snowden.


Disabling ME is part of its functionality. You’re just not given the API to do it. To the best of my knowledge this functionality has been in there from the beginning.


Yeah, I would expect it was there from the beginning, it's just never exposed as user configurable on standard firmware. Being in there from the beginning doesn't really invalidate the claim by others it was requested by DoD or some agency therein. They are obviously a big client with leverage to have a say in chip features. Even if that part is made up, the off toggles existence still implies security concerns, since the coprocessor shouldn't have any effect on performance.


This guy now works for intel: DEF CON 26 - Christopher Domas - GOD MODE UNLOCKED Hardware Backdoors in redacted x86

https://www.youtube.com/watch?v=jmTwlEh8L7g This is one of many amazing videos this guy did before they hired him.


Its commendable, but if you look closely https://github.com/MarvellEmbeddedProcessors/mv-ddr-marvell/... its not all "open". I think there is a fair amount of discussion about whether poking a bunch of random values into random registers is "open".


It's either patented or a secret, it can't be both. Patents must by definition be public.

...and hardware manufacturers make money selling hardware, so why keep the details of how to use it secret? In the days of SDRAM and DDR(1) a lot of companies, including Intel, were far more open about documentation than today.

I've analysed the init code from BIOSes and it is indeed not that complicated.


Depending on the code and vendor it is either patented or a trade secret. Most of the time the code is protected because it is required by the patent law. Remember that a lot of chip companies you know from today, started their life as memory companies.

There are also a lot of businesses that act as patent trolls , e.g. RAMBUS and stifle memory innovation by suing everyone left and right and enforcing their patents. Part of licensing agreement is to keep stuff secret and proprietary. Otherwise if you give something of value away for free, you can’t make money off of it anymore , can you?


The code involved isn’t even that long… but mistakes can result in very annoying behavior. Debugging issues with that code is a pain.


Yes, it’s a huge pain to the point where you don’t want to be stuck fielding support calls from enthusiasts tweaking things on their own. Huge pain in debugging this code


> Once you know what’s actually in the code the conspiracy theories do seem funny

Perhaps funny, but inevitable due to the lack of knowledge. Why don't you do us a solid and get us some documentation for our hardware using a cell camera?


> The code outlined a process for enabling new features by verifying cryptographically signed licences

This smells a lot like laying the groundwork for locking processor features behind a license key. Get ready for a monthly CPU license plan...


The devil's in the details but if the price is right then this isn't fundamentally worse than any CPU with factory disabled cores. The difference being that you could now pay to unlock the feature instead of buying a whole new CPU.

On the other hand if I have no guarantee the license is valid forever and I own it, or that I can use it in any conditions not just with internet connectivity, or it's tied to the rest of the machine, or only until I want to sell my hardware, then it does become fundamentally worse.

This might be a new way to enable more flexibility for cloud providers so they can customize each offering without having different sets of hardware, just a license management component. I also see this as something OEMs may be interested in, tiering otherwise identical products could be just a FW switch away, greatly simplifying logistics and manufacturing. OEMs could customize the CPUs they receive while Intel would shrink the number of SKUs. Being Intel though means none of the savings will ever reach the consumer.


Cores that are permanently disabled from the factory help improve yields. Features that can be turned back on later don't, because they still need to pass QA.


They can also improve monetary yields. On mature processes, the yields are probably higher than demand - cores are being disabled because they have more buyers for a 32 core than a 48 core chip.

If they can sell another 16 cores at no incremental cost post sale, that's attractive. As long as it doesn't prevent that customer from buying a whole new 48 core chip from them, anyway.


When Intel starts disabling cores that are fully functional, then the practice shifts over into the bad column of artificial product segmentation. It may be good for Intel, but it's a problem for consumers. Likewise, features that ship to every customer but can only be turned on by eg. Amazon are worrying for obvious reasons.

The only reason why we tolerate such clear signs of market failure is because in the larger picture, Intel et al. need to keep their margins high and consistent to keep Moore's Law alive(ish). Maintaining such a fast pace of progress is probably better for customers in the long run, even if it means getting ripped off in the short term. But without such a benefit to make the tradeoff worthwhile, customers and economic policymakers would be right to be incensed when Intel uses wasteful tactics like disabling defect-free silicon in order to insulate their price sheet from the true supply and demand dynamics of their market.


> The only reason why we tolerate such clear signs of market failure is because in the larger picture, Intel et al

Let's not forget our history here. I'm not a big fan of Intel but any techie will remember we successfully "tolerated" such practices decades ago when the Celeron 300A ran at double frequency, the Radeon X800 series got 50-100% more pixel shaders, or the Phenom X3s got 33% more cores for free. If anything customers were rejoicing.

The market was always artificially segmented. The only reason we don't know today whether disabled blocks are good or not is that manufacturers have gotten better at keeping them disabled.

You'd be hard pressed to tell me the actual real life difference for you between a $300 4-core CPU with 2 cores fused off, and a $300 4-core CPU with 2 cores licensed off.

> Intel uses wasteful tactics

Intel fashion dictates the user will see little benefit but differentiating SKUs in software is certainly less wasteful from the manufacturer's and OEM's perspective. The logistics alone might be worth it.

Software manufacturers ship full installers with features disabled in licensing. Streaming services store all content but serve you a lower quality one based on your particular "license". Car manufacturers sell essentially the same engine with different power outputs.

As I said, there are no bad products, just bad prices.


You seem to be ignoring that the hypothetical alternative would be a market with prices with downward flexibility, such that rather than gambling on unsupported hacks, you could instead get more cores for your money just by waiting for later in the product cycle. A 4-core chip with two fuzed off at $300 and a 4-core chip with two licensed off at $300 are definitely not equivalent options to be comparing, because in a market with no artificial product crippling Intel's price sheet would have to change.

The fact that artificially crippling chips is not new does not mean it's a good thing. Having accepted a tradeoff doesn't erase its downsides.


>When Intel starts disabling cores that are fully functional, then the practice shifts over into the bad column of artificial product segmentation.

Why?

If their yield improves, should lower core count chips become scarce and then approach the cost per core of higher end chips? That's not exactly in the interest of a consumer anyway.

Also, they absolutely already do that today - all chip manufacturers make viable cores inaccessible so they can have the right part mix that is selling the way they want.

If your demand curve perfectly matches your supply/yield curve, great - but that's rare over time.


I think you've lost sight of what supply and demand curves are supposed to represent. They're not supposed to line up or overlap; they're supposed to intersect at a point representing the equilibrium price. And when the supply curve shifts, that equilibrium point also shifts. That's how healthy competitive markets work.

As yields improve, consumers should expect to see changes other than dwindling stocks of low core count parts. High core count parts should be getting cheaper, or low core count parts start showing up with higher speed bins or lower voltage bins.


The distinction is purely academic and it explains the "why" but doesn't change the "what". For all intents and purposes the components have some features inaccessible for [reasons]. Similar to how features licensed in software are still usually present in the shipped libraries. What makes the difference are the price and the licensing terms for unlocking that feature.

Chips have been known to have perfectly functioning blocks disabled, or frequency limited not for improved yields but simply because the demand for a lower end SKUs was so high. Unlocking them was a matter of penciling some fuses, flashing a firmware, or simply changing a setting in the BIOS.

If anything this kind of licensing opens the door to "hardware piracy" where under the correct conditions you may be able to hack older, or unsupported hardware to gain the full functionality.


It makes sense for unlockable cores on server grade hardware, it would allow bootstrapping companies to buy in low and then upgrade on demand, but it's pretty safe to say that this is going to overflow into the user space, where you can buy a 16 core processor for your max budget and then find out there are another 16 cores that software prevented you from accessing and that would require a monthly fee to unlock, which is ridiculous.



Hadn’t seen this before.. and this license key operates at the hardware level, not the OS level? One can’t expect mpeg hw decide to function with arch or nix installed on a raspberry pi instead?


Hardware level. It's a compromise imposed on the chip builders because the MPEG decoder or part thereof that they're using is subject to patent licensing. So in order to not pay the patent license, it's shipped locked-out. When you pay them a chunk goes back to the licensor (presumably MPEG-LA).


>(presumably MPEG-LA).

MPEG 2 has been patent free since 2018 in US and 2020 worldwide. Along with AAC-LC. They are charging for license key for the usage of that specific decoder block only. i.e The IP of the actual HW decoder, not the codec itself.

MPEG 2 and VC1 is so low in complexity you are software decode it with minimal CPU resource on RPi 4.


> MPEG 2 has been patent free since 2018 in US and 2020 worldwide.

The patents were still active, and the technology was still useful, when the Pi Foundation started selling the license keys in 2012. Nowadays... yeah, they're irrelevant.


Although the later models have fast enough CPUs to not need the hardware decoding


...and it's already been cracked.


It’s been going on for years. Almost every Intel CPU has a cryptographic key embedded in it. If you get a special license from Intel you get a full unlock which allows you to “get root” on the CPU and have full debug capabilities. This is how UEFI and other low level software get debugged during CPU bring-up for example.


Intel's been doing this for 5 years already: https://www.intel.com/content/www/us/en/support/articles/000...

You have to buy some stupid dongle to enable "Intel® Virtual RAID on CPU" feature, and it's like $100 or something silly.


Sounds like Intel plans on doing the same thing with their chips that Tesla does with its batteries.


I think it would be cool to just be able to instantly get improvements to my machine without having to go through the hassle of dealing with ordering and installing physical hardware.


DRM?


People get bent out of shape with Intel doing this yet rave in using an Apple M1 MacBook Pro which is entirely 100% custom / proprietary silicon.

HN cracks me up.


"People" also think Apple laptops are stupid and overpriced. HN is not a homogeneous group with a single opinion on every topic, so such "gotcha" arguments make little sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: