> Finally, I've seen various reports that the practical size limit of a BEAM cluster is in the range of 50-100 nodes. The reason for this is that BEAM cluster establishes a fully connected mesh (each node maintains a TCP connection to all other nodes), so at some size this starts to cause problems. As far as I know, the OTP team is working to improve this, but as of OTP 22 it is still not done.
I've run clusters of 1-2k machines at my last job (maybe it was bigger, but I can't remember for sure). Holding a TCP connection to each other node is not a problem --- we certainly had a lot more connected clients than connected servers, tuning memory for buffers can be an issue on low ram systems. Global can get to be a problem, I'm not sure of the state in open source OTP, but if you have multiple nodes contending on the pg2 global lock for a group, it can get really slow; there's ways to make that better, but you do need to be careful not to introduce new deadlocks. If it's still using the simple method of try to lock everyone, if unsuccessful unlock and wait a bit and try again doesn't work well under significant contention or if one (or more) nodes is unhealthy and running slowly, but staying online.
The quality of network needed really depends on your tick timeouts, and the amount of data you're transmitting. Dist will work with slow and lossy networks as long as it can get a ping transmitted often enough. I think the default tick time is 30 seconds, and four failed ticks disconnects, so you really just need pings coming through once every two minutes, and for your OS not to give up on the TCP connection.
It wouldn't work well for mobile, but between two reasonably connected datacenters, it should be fine. Anyway, dist should only be used between nodes at the same trust level --- anything you can do on one node can be done from the other node; consider it a bidirectional shell. I've debugged plenty of cases where an intermediate link was congested resulting in very low throughput, and tens of minute message delays on dist; it was still working ok --- just anything synchronous would take forever.
>First, in my opinion distributed BEAM is mostly intended to run on a network which is fast and more reliable (such as local network). While in theory it can also work on a less reliable/slower network (e.g. geographically dispersed machines connected via Internet), in practice you might experience more frequent netsplits which could cause various problems, such as worse performance or less consistency.
This is exactly why I wasn't excited about LiveView[1]. It felt like a step backwards in terms of human-centric design. Another tool that makes us consider our network bars first and our life second.
In general, I'm kind of disappointed that Elixir isn't leading the way on decentralized and offline-first technology, but I guess it's a limitation of BEAM running on small/low-powered devices?
Have you actually tried it on a cellphone though? LiveView works decently enough for me over my cell phone connection even when using a vpn. I wouldn’t want say a todo app made using LiveView, but for dashboards and other sites which require server data anyways it does fine. Oddly I find LiveView can even feel subjectively faster than an SPA making a discrete request and then rendering the result.
https://www.reddit.com/r/elixir/comments/bronlx/discover_wha...