In the talk he glosses over what he sees as serious flaws in Erlang distribution...

toast0 · on Sept 11, 2019

From that thread:

> Finally, I've seen various reports that the practical size limit of a BEAM cluster is in the range of 50-100 nodes. The reason for this is that BEAM cluster establishes a fully connected mesh (each node maintains a TCP connection to all other nodes), so at some size this starts to cause problems. As far as I know, the OTP team is working to improve this, but as of OTP 22 it is still not done.

I've run clusters of 1-2k machines at my last job (maybe it was bigger, but I can't remember for sure). Holding a TCP connection to each other node is not a problem --- we certainly had a lot more connected clients than connected servers, tuning memory for buffers can be an issue on low ram systems. Global can get to be a problem, I'm not sure of the state in open source OTP, but if you have multiple nodes contending on the pg2 global lock for a group, it can get really slow; there's ways to make that better, but you do need to be careful not to introduce new deadlocks. If it's still using the simple method of try to lock everyone, if unsuccessful unlock and wait a bit and try again doesn't work well under significant contention or if one (or more) nodes is unhealthy and running slowly, but staying online.

The quality of network needed really depends on your tick timeouts, and the amount of data you're transmitting. Dist will work with slow and lossy networks as long as it can get a ping transmitted often enough. I think the default tick time is 30 seconds, and four failed ticks disconnects, so you really just need pings coming through once every two minutes, and for your OS not to give up on the TCP connection.

It wouldn't work well for mobile, but between two reasonably connected datacenters, it should be fine. Anyway, dist should only be used between nodes at the same trust level --- anything you can do on one node can be done from the other node; consider it a bidirectional shell. I've debugged plenty of cases where an intermediate link was congested resulting in very low throughput, and tens of minute message delays on dist; it was still working ok --- just anything synchronous would take forever.

olah_1 · on Sept 11, 2019

>First, in my opinion distributed BEAM is mostly intended to run on a network which is fast and more reliable (such as local network). While in theory it can also work on a less reliable/slower network (e.g. geographically dispersed machines connected via Internet), in practice you might experience more frequent netsplits which could cause various problems, such as worse performance or less consistency.

This is exactly why I wasn't excited about LiveView[1]. It felt like a step backwards in terms of human-centric design. Another tool that makes us consider our network bars first and our life second.

In general, I'm kind of disappointed that Elixir isn't leading the way on decentralized and offline-first technology, but I guess it's a limitation of BEAM running on small/low-powered devices?

[1]: https://github.com/phoenixframework/phoenix_live_view

brightball · on Sept 12, 2019

I guess the question is how low powered and small you want to go? Nerves seems to be flourishing.

olah_1 · on Sept 12, 2019

That's a good point. But does Nerves advance any kind of offline-first design? It still assumes a constant and high quality network connection, right?

brightball · on Sept 12, 2019

Nerves doesn’t make any assumptions about connectivity.

https://nerves-project.org/

rozap · on Sept 12, 2019

LiveView is not distributed erlang.

di4na · on Sept 12, 2019

This work tend to happen at a lower level, ie in erlang. Check Lasp https://lasp-lang.readme.io/ and its Partisan distribution networking strategies.

delta1 · on Sept 12, 2019

How does live view have anything to do with network connectivity?

If you need a live server to provide you with updates what difference does it make whether it's live view or some other json/websocket setup?

olah_1 · on Sept 12, 2019

LiveView makes basic functionality of websites only work if you have a constant, high quality connection to the server.

Even things like UI changes require the server connection.

elcritch · on Sept 13, 2019

Have you actually tried it on a cellphone though? LiveView works decently enough for me over my cell phone connection even when using a vpn. I wouldn’t want say a todo app made using LiveView, but for dashboards and other sites which require server data anyways it does fine. Oddly I find LiveView can even feel subjectively faster than an SPA making a discrete request and then rendering the result.

olah_1 · on Sept 13, 2019

"but for dashboards and other sites which require server data anyways it does fine."

I understand that apps require data anyway, but I find offline-first design to be much more compelling for most apps.

Things shouldn't halt or break if your connection goes out or is spotty.

Maybe I'm wrong and the future will be everyone cooking in a giant microwave of internet waves :)

kungfooguru · on Sept 12, 2019

Yes, check out http://partisan.cloud/. Erlang distribution was not designed for the cloud.

davidw · on Sept 11, 2019

Note also the response from José Valim.