I think one of the big issues may be with high performance multi-threaded code. x86(I am including x64 in this designation) is a lot stronger memory model than ARM. This has two implications. First, x86 is a lot more tolerant of data races, and missing explicit memory fences. When you port server applications that have been running well on x86 to ARM, you may be in for some surprises as data races and missing fences now manifest as data corruption. The other implication is that on x86, the gap between a sequentially consistent memory order and a relaxed memory order is not that great. Thus, many programmers may use atomics with sequentially consistent memory order to reduce the complexity. On x86, this will generally yield decent performance. On ARM, that gap is much bigger and you are liable to have severe performance regressions.
If I'm reading this correctly, you're basically implying that every serious QA department in a a large company might benefit from buying some ARM hardware for running their test suites in order to reveal multithreading bugs and inadvertent data races in their code?
No, that's not quite right. There are things you can do safely in x86's memory model that are not portable to ARM. But they are completely well specified if your target is x86. I.e., the hypothetical QA team that buys ARM hardware may only expose portability issues rather than bugs.
What about Java applications? If the x86 memory model lets you get away with things that are not guaranteed by the Java memory model (in the Java spec) there could be actual threading bugs that would be exposed by running on ARM.
Happily, I know nothing about Java's memory model.
I mostly had C in mind in my earlier comment. Yes, x86 formally guarantees patterns that are not guaranteed by the C standard. The key is that the behavior is implementation-defined, not undefined. Everyday C compilers targeting x86 have x86 memory model semantics.
So that's why I said (with the case of C in mind), no, running on a different memory model wouldn't necessarily expose bugs in the x86 target. Your program might not be portable to another implementation, but that isn't inherently a bug. (Especially given x86's more or less omnipresence in the software space, from pretty low end up to fairly high end systems.)
In modern C (C11 and newer) you would prefer developers use portable memory constructs such as atomic_store_explicit or atomic_load_explicit with some particular memory_order semantics. These are specified in §7.17.3 "Order and consistency." (The C17 publication of the same section might be more clear, I just wanted to illustrate the C language has had portable constructs for this since the 2011 version.) Of course, it is possible that developers use more relaxed semantics than are actually (portably) valid, and it happens to work on x86 just like code that doesn't use C11 memory model atomics.
(And the vast majority of developers should be using higher level constructs like mutexes or rwlocks or existing lock-free data structure libraries, such as ConcurrencyKit[1], instead of messing with complicated memory semantics. I suppose the same is true in Java land.)
In C land, there are starting to be nice tools to detect these kind of things explicitly rather than just observing memory corruption on ARM, such as KTSAN and KCSAN. I don't know if Java has anything similar.
Anyway, I don't know if any of that is useful to you. Sorry for the wall of text.
Do you happen to know of any studies or benchmarks that show how much of a difference in performance there is between strong and relaxed memory consistency models for real-world workloads? It's something that I've been curious about for a while.
One upside of code increasingly being written in languages that are predominantly single threaded like Python/JS is that these issues do not matter as much.
In the case of Scylla, we run on the Seastar engine, which runs single-threaded per core because it is very "greedy." Hence the CPU being pegged at 100%. It wasn't thrashing. We just squeezed everything we could out of it.
We do parallelism across CPUs and nodes. We run single-threaded to get the most out of a CPU in a shared-nothing architecture. Many single-threaded apps aren't written to really take advantage of all a CPU has to offer. But there are also prices to pay to run multi-threaded; context switches, etc.
Not having (thread-level) parallelism encourages designs with small, well defined interfaces where data crosses threads. In a typical NodeJS server setup with one load balancer and a number of node processes that don't know about each other you will experience fewer concurrency bugs and care less about NUMA than in the equivalent ASP.NET app (requests scheduled onto a thread pool, code can share resources at will).
Of course sometimes "small interface" isn't really viable from a performance standpoint and you want a well-engineered multicore application with lots of shared data, and you just can't do that with JS or Python (or at least it's very hard). That's a good reason to choose a different language.