Ironically, here at Fly.io, we run containers (in single-use VMs) for our custom...

po_ta_toes · on March 22, 2022

Hi there,

Very interesting read!

I work for a large news org, The team I'm in primarily uses elixir, which I know the people at fly.io love too!

Why did you decide not to containerise your own infrastructure?

We use some 'chonky' ec2s but are thinking about using containers.

Given that the BEAM has quite a large footprint, do you think it still a good candidate for containers, or would that introduces too much overhead?

tptacek · on March 23, 2022

We would containerize everything if we could! But our infrastructure components usually live outside the security boundaries our container interface sets up. The proxy has to be able to talk to every app, not just apps in its own organization, and it needs direct access to other infrastructure components. The orchestration code needs to be able to launch Firecracker VMs --- it can't itself be a Firecracker VM.

We can design around all these constraints! We just haven't gotten around to it yet. I think there's a general consensus that hiding things inside Firecrackers is a good design, and we'll do it where we can.

Elixir runs great inside Firecrackers. The idea behind Firecracker is that it introduces a minimal load, by dint of being ruthlessly simple. When you run an Elixir app on an EC2 instance, you're running hypervised as well (and probably by a hypervisor that's less efficient --- not a knock, Firecracker is Amazon code too).

The thing that makes this doable at Fly.io is that we run our own hardware, so we're not trying to nest VMs inside VMs. Of course, that's what EC2 is doing too. So whatever your intuitions are about things that run well on vanilla EC2, they should carry over to us.

freedomben · on March 23, 2022

What are you using for the host OS on your bare metal?

tptacek · on March 23, 2022

Linux. :)

freedomben · on March 23, 2022

heh, could you share if it's more of an rpm flavor or more of a deb flavor? :-) (or some third option. no false dichotomies intended. actually it would be badass if you're running Arch :-D)

tptacek · on March 24, 2022

At this moment I will go as far as to say that it is not a badass flavor of Linux.

:)

cpach · on March 22, 2022

Intriguing!

This makes me curious: How does one learn to design and build systems like this…?

Also: How do you folks at Fly decide what parts to use “as is” and what parts to build from scratch? Do you have any specific process for making those choices?

tptacek · on March 22, 2022

We had to build the orchestration stuff (it was originally a Nomad driver, but has outgrown that) because the tooling to run OCI containers as Firecracker VMs didn't exist in a deployable form when we started doing this stuff.

Most of the big CDNs seem to start with an existing traffic server like Nginx, Varnish, or ATS. One way to look at what we did with our "CDN" layer is that rather than building on top of something like Nginx, we built on top of Tokio and Hyper and its whole ecosystem. We have more control this way, and our routing needs are fussy.

By comparison, we use VictoriaMetrics and ElasticSearch (I don't know about "as-is" --- lots of tooling! --- but we don't muck with the cores of these packages), because our needs are straightforwardly addressed by what's already there.

Lots of companies doing stuff similar to what we're doing have elaborate SDN and "service mesh" layers that they built. We get away with the Linux kernel networking stack and a couple hundred lines of eBPF.

We definitely don't have a specific process for this stuff; it's much more an intuition, and is more about our constraints as a startup than about a coherent worldview.

illfit · on March 23, 2022

Google uses "Non-Abstract Large System Design (NALSD)" https://sre.google/workbook/non-abstract-design/ for this style of design.

The emphasis on a concrete design with concrete numbers can help identify the main scaling and reliability limitations, and put a cost on these. "Design X costs $A/year for Y scheduled fly.io tasks".

To build such a design relies on knowing fundamentals such as the performance characteristics of CPU/disk/network. "How many disks would it take to serve 50k QPS at 20ms, each time performing 1k of random disk I/O."

Knowing this helps identify where in your stack you want flexibility, and why you'd want it.

"We log 100MB/s spread across 200 machines, which can be done with vanilla mature Elastic search"

"We need to be able to perform container routing at 0.5ms overhead, and updates need to be atomic. eBPF can do this but existing solutions are immature. Since this is also our core competency, let's do this ourselves."

freedomben · on March 23, 2022

I can't speak for fly.io, but as someone who architects a lot of systems, you learn by doing, failing, and doing again. If you're smart your failures will all be proofs-of-concept rather than real-world, but some amount of real-world failure is inevitable.

You must learn how to learn from mistakes, because the more typical human reaction to failure is to get emotional and try to rationalize the failure away (including blaming others). You need to become ruthlessly analytical. Good analytical intuition can only come from experience/failure so you can't rush it.

Start with simple/straight forward problems to solve. For example, design an architecture for a headless, stateless API server. Start with a single instance exposed directly to the internet, and then put a load balancer in front of it and scale it up to at least 3 instances (horizontal scaling).

Now add a relational database (like postgres or mysql). Now make the server stateful (only for practice, avoid stateful services this like the plague in real life).

Now add a websocket to the server (which will need to go through the load balancer).

Now add a UI. Start with a few server-rendered pages, and then add a SPA. Try serving the SPA from the server app, and also try separating it into two distinct, independently deployable apps.

Now add some UDP to the app (adding WebRTC is an easy way to do this without having to write a lot of code).

Now add asynchronous jobs (sometimes called delayed jobs) just using the existing database.

Now add a queue system (like redis) for the jobs to communicate through and scale up the number of job instances.

Now add logging and metrics collection. I recommend EFK stack for logging and Prometheus/Grafana for metrics. At this point if you haven't used containers or kubernetes yet, it's probably a good time to repeat the whole process in k8s.

Now start adding microservices! Start with simple ones and get increasingly complex. Add a service that is never accessed directly by users, and put mTLS in front of it. Make sure that your services are aggregating logs to your logging solution and metrics are being collected.

Now add some services that use gRPC, protobufs, etc. You already have some non-HTTP in the form of UDP, but try adding some non-HTTP TCP based services. An SMTP server is a good challenge. You'll have to start getting creative!

To be well rounded, try using various products/approaches and see how they solve problems you have. It's good to do things "the hard way" (such as setup manually on VMs) to learn how they work, but IRL you won't want to do everything that way. Kubernetes is exploding for a reason. Make sure you learn some industry tools/solutions as well. I prefer open source to vendors as the vendors often obscure all the learning (which can be good for a real company, but isn't good for your goal of learning).

As you get better and more experienced, many of these scenarios can be done entirely as thought experiments, although if you are doing it for real, you should try to build a PoC (proof of concept) for anything that hasn't been proven before.

Lastly, bounce ideas off of people with experience! Be prepared to have your ideas challenged, and be prepared to critically think about others ideas and challenge them! In the real world you will rarely design it all yourself.

Pretty soon, you will be able to design and build complex systems. Also the Google SRE book(s) can be really helpful.

Holy cow this went a lot longer than I intended. I think I'll turn this into a blog post.