Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Write correct, readable, simple and maintainable software, and tune it when you're done, with benchmarks to identify the choke points

If speed is a primary concern, you can't tack it on at the end, it needs to be built in architecturally. Benchmarks applied after meeting goals of read/maintainability are only benchmarking the limits of that approach and focus.

They can't capture the results of trying and benchmarking several different fundamental approaches made at the outset in order to best choose the initial direction. In this case "optimisation" is almost happening first.

Sometimes the fastest approach may not be particularly maintainable, and that may be just fine if that component is not expected to require maintaining, eg, a pure C bare-metal in a bespoke and one-off embedded environment.



Well, yes. Architect for performance, try not to do anything "dumb", but save micro-optimizations for after performance measurement.


The problem with all of these rules of thumb is that they're vague to the point of being vacuously true. Of course we all agree that "premature optimization is the root of all evil" as Knuth once said, but the saying itself is basically a tautology: if something is "premature", that already means it's wrong to do it.

I'll be more impressed when I see specific advice about what kinds of "optimizations" are premature. Or, to address your reply specifically, what counts as "doing something dumb" vs. what is a "micro-optimization". And, the truth is, you can't really answer those questions without a specific project and programming language in mind.

But, what I do end up seeing across domains and programming languages is that people sacrifice efficiency (which is objective and measurable, even if "micro") for a vague idea of what they consider to be "readable" (today--ask them again in six months). What I'm specifically thinking of is people writing in programming languages with eager collection types that have `map`, `filter`, etc methods, and they'll chain four or five of them together because it's "more readable" than a for-loop. The difference in readability is absolutely negligible to any programmer, but they choose to make four extra heap-allocated, temporary, arrays/lists and iterate over the N elements four or five times instead of once because it looks slightly more elegant (and I agree that it does). Is it a "micro-optimization" to just opt for the for-loop so that I don't have to benchmark how shitty the performance is in the future when we're iterating over more elements than we thought we'd ever need to? Or is it not doing something dumb? To me, it seems ridiculous to intentionally choose a sub-optimal solution when the optimal one is just as easy to write and 99% (or more) as easy to read/understand.


Ok, a bit more detail then. :)

Architecting for performance means picking your data structures, data flow, and algorithms with some thought towards efficiency for the application you have in mind. Details will vary a lot depending on context. But as many folks have said, this sort of thing can't be done after the fact.

As for "doing something dumb", I've often seem fellow engineers do things like repeatedly insert into sorted data structures in a loop instead of just inserting into an unsorted structure and then sorting after the inserts. If you think about it for just a minute, it should be obvious why that's not smart (for most cases.) Stuff like that.

What do I mean by "micro-optimizations"? Taking a clearly written function and spending a lot of time making it as efficient _as_possible_ (possibly at the expense of clarity) without first doing some performance analysis to see if it matters.

Nobody's saying to pick suboptimal solutions at all.


> As for "doing something dumb", I've often seem fellow engineers do things like repeatedly insert into sorted data structures in a loop instead of just inserting into an unsorted structure and then sorting after the inserts. If you think about it for just a minute, it should be obvious why that's not smart (for most cases.) Stuff like that.

That's a great example that I've seen in the wild as well!

> Nobody's saying to pick suboptimal solutions at all.

No, I realize that. And most of my comment wasn't intended as some kind of direct disagreement to yours. It was mostly just some observations. One of which is that advice about writing efficient code is usually too vague to be useful, and the other is that people take the "don't optimize without measuring" advice to mean something ridiculous in the opposite extreme that reads more like "just write whatever garbage looks pretty to you because any forethought about what makes sense to the computer is premature optimization". I wasn't trying to say that's what you were advocating for, though.


I don't know if this embedded development still alive. I'm writing firmware for nRF BLE chip which is supposed to run from battery and their SDK uses operating system. Absolutely monstrous chips with enormous RAM and Flash. Makes zero sense to optimize for anything, as long as device sleeps well.


A little over 10 years ago I was doing some very resource-constrained embedded programming. We had been using custom chip with an 8051-compatible instruction set (plus some special purpose analogue circuitry) with a few hundred bytes of RAM. For a new project we used an ARM Cortex M0, plus some external circuitry for analogue parts.

The difference was ridiculous - we were actually porting a prototype algorithm from a powerful TI device with hardware floating point. It turned out viable to simply compile the same algorithm with software emulation of floating point - the Cortex M0 could keep up.

Having said all that though: the 8051 solution was so much physically smaller that the ARM just wouldn't have been viable in some products (this was more significant because having the analogue circuitry on-chip limited how small the feature size for the digital part of the silicon could be).

Obviously that was quite a while ago! But even at the time, I was amazed how much difference the simpler chip made actually made to the size of the solution. The ARM would have been a total deal breaker for that first project, it would just have been too big. I could certainly believe people are still programming for applications like that where a modern CPU doesn't get a look in.


Probably right in the broader sense, but there are still niches. Eg, for one: space deployments, where sufficiently hardened parts may lag decades behind SOTA and the environ can require a careful balance of energy/heat against run-time.


It's still alive, but pushed down the layers. The OS kernel on top of which you sit still cares about things like interrupt entry latency, which means that stack usage analysis and inlining management has a home, etc... The bluetooth radio and network stacks you're using likely has performance paths that force people to look at disassembly to understand.

But it's true that outside the top-level "don't make dumb design decisions" decision points, application code in the embedded world is reasonably insulated form this kind of nonsense. But that's because the folks you're standing on did the work for you.


i just learned the other day that you can get a computer for 1.58¢ in quantity 20000: https://jlcpcb.com/partdetail/NyquestTech-NY8A051H/C5143390

if we can believe the datasheet, it's basically a pic12f clone (with 55 'powerful' instructions, most single-cycle) with 512 instructions of memory, a 4-level hardware stack, and 32 bytes of ram, with an internal 20 megahertz clock, 20 milliamps per pin at 5 volts, burning half a microamp in halt mode and 700 microamps at full speed at 3 volts

and it costs less than most discrete transistors. in fact, although that page is the sop-8 version, you can get it in a sot23-6 package too

there are definitely a lot of things you can do with this chip if you're willing to optimize your code. but you aren't going to start with a 30-kilobyte firmware image and optimize it until it fits

yeah it's not an nrf52840 and you probably can't do ble on it. but the ny8a051h costs 1.58¢, and an nrf52840 costs 245¢, 154 times as much, and only runs three times as fast on the kinds of things you'd mostly use the ny8a051h for. it does have a lot more than 154 times as much ram tho

for 11.83¢ you can get a ch32v003 https://www.lcsc.com/product-detail/Microcontroller-Units-MC... which is a 48 megahertz risc-v processor with 2 kilobytes of ram, 16 kilobytes of flash, a 10-bit 1.7 megahertz adc, and an on-chip op-amp. so for 5% of the cost of the nrf52840 you get 50% of the cpu speed, 1.6% of the ram, and 0% of the bluetooth

for 70¢, less than a third the price of the nrf52840, you can get an ice40ul-640 https://www.lcsc.com/product-detail/Programmable-Logic-Devic... which i'm pretty sure can do bluetooth. though it might be saner to hook it up to one of the microcontrollers mentioned above (or maybe something with a few more pins), you can probably fit olof kindgren's serv implementation of risc-v https://github.com/olofk/serv into about a third of it and probably get over a mips out of it. but the total amount of block ram is 7 kilobytes. the compensating virtue is that you have another 400 or so luts and flip-flops to do certain kinds of data processing a lot faster and more predictably than a cpu can. 19 billion bit operations per second and pin-to-pin latency of 9 nanoseconds

so my summary is that there's a lot of that kind of embedded work going on, maybe more than ever, and you can do things today that were impossible only a few years ago


Just to be pedantic, if it's a clone of the 8 bit PICs, then one instruction takes 4 clock cycles, so a 20MHz clock should be considered 5MHz if you're trying to compare operations per second.


that's a good point! i wondered about that, but i don't have the chip yet, so i checked the datasheet. the datasheet lists a cycle count for each instruction, and as i said, most instructions are 1 cycle

on the other hand, something like a 32-bit multiplication or a floating-point subtraction is going to cost a lot of instructions, if you can afford it at all


That was my way of thinking as I was junior programming.


And now...?


After being burn waaay too many times with one of: 1) write only code (for the sake of “speed” 2) optimization of the wrong piece of code

I do think it is much better to prioritize readability; then measure where the code has to be sped up, and then do changes, but try HARD to first find a better algorithm, and if that does not work, and more processor, or equipment is not viable or still does not work, go for less readable code, which is microoptimized


They're a manager and send out daily emails reminding the coders of arbitrary deadlines.


Who they?!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: