I think one of the big issues may be with high performance multi-threaded code. ...

PeterisP · on Dec 5, 2019

If I'm reading this correctly, you're basically implying that every serious QA department in a a large company might benefit from buying some ARM hardware for running their test suites in order to reveal multithreading bugs and inadvertent data races in their code?

loeg · on Dec 6, 2019

No, that's not quite right. There are things you can do safely in x86's memory model that are not portable to ARM. But they are completely well specified if your target is x86. I.e., the hypothetical QA team that buys ARM hardware may only expose portability issues rather than bugs.

ptx · on Dec 6, 2019

What about Java applications? If the x86 memory model lets you get away with things that are not guaranteed by the Java memory model (in the Java spec) there could be actual threading bugs that would be exposed by running on ARM.

loeg · on Dec 6, 2019

Happily, I know nothing about Java's memory model.

I mostly had C in mind in my earlier comment. Yes, x86 formally guarantees patterns that are not guaranteed by the C standard. The key is that the behavior is implementation-defined, not undefined. Everyday C compilers targeting x86 have x86 memory model semantics.

So that's why I said (with the case of C in mind), no, running on a different memory model wouldn't necessarily expose bugs in the x86 target. Your program might not be portable to another implementation, but that isn't inherently a bug. (Especially given x86's more or less omnipresence in the software space, from pretty low end up to fairly high end systems.)

In modern C (C11 and newer) you would prefer developers use portable memory constructs such as atomic_store_explicit or atomic_load_explicit with some particular memory_order semantics. These are specified in §7.17.3 "Order and consistency." (The C17 publication of the same section might be more clear, I just wanted to illustrate the C language has had portable constructs for this since the 2011 version.) Of course, it is possible that developers use more relaxed semantics than are actually (portably) valid, and it happens to work on x86 just like code that doesn't use C11 memory model atomics.

(And the vast majority of developers should be using higher level constructs like mutexes or rwlocks or existing lock-free data structure libraries, such as ConcurrencyKit[1], instead of messing with complicated memory semantics. I suppose the same is true in Java land.)

In C land, there are starting to be nice tools to detect these kind of things explicitly rather than just observing memory corruption on ARM, such as KTSAN and KCSAN. I don't know if Java has anything similar.

Anyway, I don't know if any of that is useful to you. Sorry for the wall of text.

[1]: http://concurrencykit.org/

cozzyd · on Dec 6, 2019

Sure but why not just use helgrind/drd/ThreadSanitizer?

morning_gelato · on Dec 6, 2019

Do you happen to know of any studies or benchmarks that show how much of a difference in performance there is between strong and relaxed memory consistency models for real-world workloads? It's something that I've been curious about for a while.

dman · on Dec 5, 2019

One upside of code increasingly being written in languages that are predominantly single threaded like Python/JS is that these issues do not matter as much.

PeterCorless · on Dec 5, 2019

In the case of Scylla, we run on the Seastar engine, which runs single-threaded per core because it is very "greedy." Hence the CPU being pegged at 100%. It wasn't thrashing. We just squeezed everything we could out of it.

dman · on Dec 5, 2019

I think Scylla is a bit of an outlier in terms of software quality :) [I mean that as a compliment]

anonuser123456 · on Dec 5, 2019

If you want to exploit parallel execution for performance, not having parallelism is not a benefit.

PeterCorless · on Dec 5, 2019

We do parallelism across CPUs and nodes. We run single-threaded to get the most out of a CPU in a shared-nothing architecture. Many single-threaded apps aren't written to really take advantage of all a CPU has to offer. But there are also prices to pay to run multi-threaded; context switches, etc.

wongarsu · on Dec 5, 2019

Not having (thread-level) parallelism encourages designs with small, well defined interfaces where data crosses threads. In a typical NodeJS server setup with one load balancer and a number of node processes that don't know about each other you will experience fewer concurrency bugs and care less about NUMA than in the equivalent ASP.NET app (requests scheduled onto a thread pool, code can share resources at will).

Of course sometimes "small interface" isn't really viable from a performance standpoint and you want a well-engineered multicore application with lots of shared data, and you just can't do that with JS or Python (or at least it's very hard). That's a good reason to choose a different language.

retrovm · on Dec 5, 2019

Being a good-enough CPU for terrible software is a bit of a tragic conclusion, in my view.

loeg · on Dec 6, 2019

ARM definitely doesn't win on single thread performance, so this is even worse for ARM. No?