Doesn't that risk cascading failures? A cluster of a few machines experiences a ...

scottlamb · on Feb 23, 2017

> A cluster of a few machines experiences a bunch of requests that trigger pathological memory usage. One machine OOMs, drops out. Now the rest of the cluster has to take more load, needs more memory, and increases the likelihood that the other machines also run out of memory.

A performance cliff (as you'd inevitably see while swapping) also puts you at risk of cascading failure. It might actually be better to completely drop out if the restart time is reasonably low. This is similar to GC thrashing with Java servers: many people prefer to configure their servers to suicide when GC time is over some threshold rather than try to go on as long as possible. I'm one of those people.

Better ways to avoid cascading failure are overprovisioning (RAM is pretty cheap for servers) and load shedding / graceful degradation at the application layer, coupled with care in client-side retry logic. (Avoiding accidental capacity caches, using exponential backoff on any retry.)