Ironically, here at Fly.io, we run containers (in single-use VMs) for our customers, but none of our own infrastructure is containerized --- though some of our customer-facing stuff, like the API server, is.
We have a big fleet of machines, mostly in two roles (smaller traffic-routing "edge" hosts that don't run customer VMs, and chonky "worker" hosts that do). All these hosts run `fly-proxy`, a Rust CDN-style proxy server we wrote, and `attache`, a Consul-to-sqlite mirroring server we built in Go. The workers also run our orchestration code, all in Go, and Firecracker (which is Rust). Workers and WireGuard gateways run a Go DNS server we wrote that syncs with Consul. All these machines are linked together in a WireGuard mesh managed in part by Consul.
The servers all link to our logging and metrics stack with Vector and Telegraf; our core metrics stack is another role of chonky machines running VictoriaMetrics.
We build our code with a Buildkite-based CI system and deploy with a mixture of per-project `ctl` scripts and `fcm`, our in-house Ansible-like. Built software generally gets staged on S3 and pulled by those tools.
Happy to answer any questions you have. I think we fit the bill of what you're asking about, even though if you read the label on our offering you'd get the opposite impression.
We would containerize everything if we could! But our infrastructure components usually live outside the security boundaries our container interface sets up. The proxy has to be able to talk to every app, not just apps in its own organization, and it needs direct access to other infrastructure components. The orchestration code needs to be able to launch Firecracker VMs --- it can't itself be a Firecracker VM.
We can design around all these constraints! We just haven't gotten around to it yet. I think there's a general consensus that hiding things inside Firecrackers is a good design, and we'll do it where we can.
Elixir runs great inside Firecrackers. The idea behind Firecracker is that it introduces a minimal load, by dint of being ruthlessly simple. When you run an Elixir app on an EC2 instance, you're running hypervised as well (and probably by a hypervisor that's less efficient --- not a knock, Firecracker is Amazon code too).
The thing that makes this doable at Fly.io is that we run our own hardware, so we're not trying to nest VMs inside VMs. Of course, that's what EC2 is doing too. So whatever your intuitions are about things that run well on vanilla EC2, they should carry over to us.
heh, could you share if it's more of an rpm flavor or more of a deb flavor? :-) (or some third option. no false dichotomies intended. actually it would be badass if you're running Arch :-D)
This makes me curious: How does one learn to design and build systems like this…?
Also: How do you folks at Fly decide what parts to use “as is” and what parts to build from scratch? Do you have any specific process for making those choices?
We had to build the orchestration stuff (it was originally a Nomad driver, but has outgrown that) because the tooling to run OCI containers as Firecracker VMs didn't exist in a deployable form when we started doing this stuff.
Most of the big CDNs seem to start with an existing traffic server like Nginx, Varnish, or ATS. One way to look at what we did with our "CDN" layer is that rather than building on top of something like Nginx, we built on top of Tokio and Hyper and its whole ecosystem. We have more control this way, and our routing needs are fussy.
By comparison, we use VictoriaMetrics and ElasticSearch (I don't know about "as-is" --- lots of tooling! --- but we don't muck with the cores of these packages), because our needs are straightforwardly addressed by what's already there.
Lots of companies doing stuff similar to what we're doing have elaborate SDN and "service mesh" layers that they built. We get away with the Linux kernel networking stack and a couple hundred lines of eBPF.
We definitely don't have a specific process for this stuff; it's much more an intuition, and is more about our constraints as a startup than about a coherent worldview.
The emphasis on a concrete design with concrete numbers can help identify the main scaling and reliability limitations, and put a cost on these. "Design X costs $A/year for Y scheduled fly.io tasks".
To build such a design relies on knowing fundamentals such as the performance characteristics of CPU/disk/network. "How many disks would it take to serve 50k QPS at 20ms, each time performing 1k of random disk I/O."
Knowing this helps identify where in your stack you want flexibility, and why you'd want it.
"We log 100MB/s spread across 200 machines, which can be done with vanilla mature Elastic search"
"We need to be able to perform container routing at 0.5ms overhead, and updates need to be atomic. eBPF can do this but existing solutions are immature. Since this is also our core competency, let's do this ourselves."
I can't speak for fly.io, but as someone who architects a lot of systems, you learn by doing, failing, and doing again. If you're smart your failures will all be proofs-of-concept rather than real-world, but some amount of real-world failure is inevitable.
You must learn how to learn from mistakes, because the more typical human reaction to failure is to get emotional and try to rationalize the failure away (including blaming others). You need to become ruthlessly analytical. Good analytical intuition can only come from experience/failure so you can't rush it.
Start with simple/straight forward problems to solve. For example, design an architecture for a headless, stateless API server. Start with a single instance exposed directly to the internet, and then put a load balancer in front of it and scale it up to at least 3 instances (horizontal scaling).
Now add a relational database (like postgres or mysql). Now make the server stateful (only for practice, avoid stateful services this like the plague in real life).
Now add a websocket to the server (which will need to go through the load balancer).
Now add a UI. Start with a few server-rendered pages, and then add a SPA. Try serving the SPA from the server app, and also try separating it into two distinct, independently deployable apps.
Now add some UDP to the app (adding WebRTC is an easy way to do this without having to write a lot of code).
Now add asynchronous jobs (sometimes called delayed jobs) just using the existing database.
Now add a queue system (like redis) for the jobs to communicate through and scale up the number of job instances.
Now add logging and metrics collection. I recommend EFK stack for logging and Prometheus/Grafana for metrics. At this point if you haven't used containers or kubernetes yet, it's probably a good time to repeat the whole process in k8s.
Now start adding microservices! Start with simple ones and get increasingly complex. Add a service that is never accessed directly by users, and put mTLS in front of it. Make sure that your services are aggregating logs to your logging solution and metrics are being collected.
Now add some services that use gRPC, protobufs, etc. You already have some non-HTTP in the form of UDP, but try adding some non-HTTP TCP based services. An SMTP server is a good challenge. You'll have to start getting creative!
To be well rounded, try using various products/approaches and see how they solve problems you have. It's good to do things "the hard way" (such as setup manually on VMs) to learn how they work, but IRL you won't want to do everything that way. Kubernetes is exploding for a reason. Make sure you learn some industry tools/solutions as well. I prefer open source to vendors as the vendors often obscure all the learning (which can be good for a real company, but isn't good for your goal of learning).
As you get better and more experienced, many of these scenarios can be done entirely as thought experiments, although if you are doing it for real, you should try to build a PoC (proof of concept) for anything that hasn't been proven before.
Lastly, bounce ideas off of people with experience! Be prepared to have your ideas challenged, and be prepared to critically think about others ideas and challenge them! In the real world you will rarely design it all yourself.
Pretty soon, you will be able to design and build complex systems. Also the Google SRE book(s) can be really helpful.
Holy cow this went a lot longer than I intended. I think I'll turn this into a blog post.
We have a big fleet of machines, mostly in two roles (smaller traffic-routing "edge" hosts that don't run customer VMs, and chonky "worker" hosts that do). All these hosts run `fly-proxy`, a Rust CDN-style proxy server we wrote, and `attache`, a Consul-to-sqlite mirroring server we built in Go. The workers also run our orchestration code, all in Go, and Firecracker (which is Rust). Workers and WireGuard gateways run a Go DNS server we wrote that syncs with Consul. All these machines are linked together in a WireGuard mesh managed in part by Consul.
The servers all link to our logging and metrics stack with Vector and Telegraf; our core metrics stack is another role of chonky machines running VictoriaMetrics.
We build our code with a Buildkite-based CI system and deploy with a mixture of per-project `ctl` scripts and `fcm`, our in-house Ansible-like. Built software generally gets staged on S3 and pulled by those tools.
Happy to answer any questions you have. I think we fit the bill of what you're asking about, even though if you read the label on our offering you'd get the opposite impression.