Thank you <3. I apologize the email was hazy on details. I can't un-send it, but I'll work with the product teams to make sure they are crystal clear in the future. I'm interested to learn more about what you mean about outdated docs? The documentation I'm seeing appears to have been updated. Can you drop me a screenshot, maybe on Twitter (same username, DMs are open).
These changes won't take effect until June - customers won't start getting billed immediately. I'm sorry that you feel trapped, that's not our intention.
> You should keep existing clusters in the pricing model they’ve been built in, and apply this change for clusters created after today.
This is great feedback, but clusters should be treated like cattle, not pets. I'd love to learn more about why your clusters must be so static.
> This is great feedback, but clusters should be treated like cattle, not pets. I'd love to learn more about why your clusters must be so static.
What’s inside our clusters are indeed cattle, but the clusters themselves do carry a lot of config that is set via GCP UI for trivial things like firewall rules. Of course we could script it and automate, but your CLI tool also changes fast enough that it becomes an ongoing maintenance burden shifted from DevOps to engineers to track. In other words, it will likely incur downtime due to unforeseen small issues.
It’s also in you guys’ interest that we don’t do this and clusters are as static as possible right now, since if we are risking downtime and moving clusters, we’re definitely moving that cluster back to AWS.
Hmm - have you considered a tool like Terraform or Deployment manager for creating the clusters? In general, it's best practice to capture that configuration as code.
Managing clusters in terraform is not enough to "treat clusters like cattle". Changing a cluster from a zonal cluster to a regional cluster in the terraform configuration will, upon a terraform apply, first destroy the cluster then re-create the cluster. All workloads will be lost.
I'm sure there are tools out there to help with cluster migrations, but it is far from trivial.
Cloud providers assume everyone is just like them, or like Netflix: Load balancers in clusters balancing groups of clusters. Clusters of clusters. Many regions. Many availability zones. Anything can be lost, because an individual data centre is just 5% of the total, right?
Meanwhile most of my large government customers have a couple of tiny VMs for every website. That's it. That's already massive overkill because they see 10% max load, so they're wasting money on the rest of the resources that are there only for redundancy. Taking things to the next level would be almost absurd, but turning things off unnecessarily is still an outage.
This is why I don't believe any of the Cloud providers are ready for enterprise customers. None of you get it.
I think you're wrong -- containers aren't ready for legacy enterprise, VMs are a better choice of abstraction for an initial move to cloud.
Get your data centers all running VMWare, then VMDK import to AWS AMIs, then wrap them all in autoscaling groups, figure out where the SPOFs have moved to, and only then start moving to containers.
In the mean time, all new development happens on serverless.
Don't let anything new connect to a legacy database directly, only via API at worst, or preferably via events.
I had a similar conversation with a government customer, saying that they should pool their web applications into a single shared Azure Web App Service Plan, because then instead of a bunch of small "basic" plans they could get a "premium" plan and save money.
They rejected it because it's "too complex to do internal chargebacks" in a shared cluster model.
This is what I mean: The cloud is for orgs with one main application, like Netflix. It's not ready for enterprises where the biggest concern is internal bureaucracy.
Why would one want lots of little GKE clusters, anyway? Google itself doesn't part up its clusters this way, AFAIK. I don't want a cluster of underutilized instances per application tier per project; I want a private Borg to run my instances on—a way to achieve the economies-of-scale of pod packing, with merely OS-policy-level isolation between my workloads, because they're all my workloads anyway.
(Or, really, I'd rather just run my workloads directly on some scale-free multitenant k8s cluster that resembled Borg itself—giving me something resembling a PaaS, but with my own custom resource controllers running in it. Y'know, the k8s equivalent to BigTable.)
We run lots of small clusters in our projects and identical infrastructure/projects for each of our environments.
Multiple clusters lets us easily firewall off communication to compute instances running in our account based on the allocated IP ranges for our various clusters (all our traffic is default-deny and has to be whitelisted). Multiple clusters lets us have a separate cluster for untrusted workloads that have no secrets/privileges/service accounts with access to gcloud.
Starting in June our monthly bill is going to go up by thousands. All regional clusters.
Namespaces handle most of these issues. A NetworkPolicy can prevent pods within a namespace from initiating or receiving connections from other namespaces, forcing all traffic through an egress gateway (which can have a well-known IP address, but you probably want mTLS which the ingress gateway on the other side can validate; Istio automates this and I believe comes set up for free in GKE.) Namespaces also isolate pods from the control plane; just run the pod with a service account that is missing the permissions that worry you, or prevent communication with the API server.
GKE has the ability to run pods with gVisor, which prevents the pod from communicating with the host kernel, even maliciously. (I think they call these sandboxes.)
The only reason to use multiple clusters is if you want CPU isolation without the drawbacks of cgroups limits (i.e., awful 99%-ile latency when an app is being throttled), or you suspect bugs in the Linux kernel, gVisor, or the CNI. (Remember that you're in the cloud, and someone can easily have a hypervisor 0-day, and then you have no isolation from untrusted workloads.)
Cluster-scoped (non-namespaced) resources are also a problem, though not too prevalent.
Overall, the biggest problem I see with using multiple clusters is that you end up wasting a lot of resources because you can't pack pods as efficiently.
For those that didn't click through, I believe the parent is demonstrating that it is a best practice to have many clusters for a variety of reasons such as: "Create one cluster per project to reduce the risk of project-level configurations"
For robust configuration yes. However one can certainly collapse/shrink if having multiple clusters is going to be a burden cost-wise and operation-wise. This best practices was modeled based on the most robust architecture.
It's probably not worth $75/month to prevent developer A's pod from interfering with developer B's pod due to an exploit in gVisor, the linux kernel, the hypervisor, or the CPU microcode. Those exploits do exist (remember Spectre and Meltdown), but probably aren't relevant to 99% of workloads.
Ultimately, all isolation has its limits. Traditional VMs suffer from hypervisor exploits. Dedicated machines suffer from network-level exploits (network card firmware bugs, ARP floods, malicious BGP "misconfigurations"), etc. You can spend an infinite amount of money while still not bringing the risk to zero, so you have to deploy your resources wisely.
Engineering is about balancing cost and benefit. It's not worth paying a team of CPU engineers to develop a new CPU for you because you're worried about Apache interfering with MySQL; the benefit is near-zero and the cost is astronomical. Similarly, it doesn't make sense to run the two applications in two separate Kubernetes clusters. It's going to cost you thousands of dollars a month in wasted CPUs sitting around, control plane costs, and management, while only protecting you against the very rare case of someone compromising Apache because they found a bug in MySQL that lets them escape the sandbox.
Meanwhile, people are sitting around writing IP whitelists for separate virtual machines because they haven't bothered to read the documentation for Istio or Linkerd which they get for free and actually adds security, observability, and protection against misconfiguration.
Everyone on Hacker News is that 1% with an uncommon workload and an unlimited budget, but 99% of people are going to have a more enjoyable experience by just sharing a pool of machines and enforcing policy at the Kubernetes level.
It doesn't have to be malicious.
File Descriptors aren't part of the isolation offered by cgroups, a misconfigured pod can exhaust FDs on the entire underlying Node and severely impact all other pods running on that node.
Network isn't isolated either. You can saturate the network on a node by downloading large amount of data from maybe GCS/S3 and impact all pods on the node.
I agree with most things you’ve said around gVisor providing sufficient security, but it's not just about security, noisy neighbors are a big issue in large clusters.
This is an interesting exchange if only for the thread developing instead of a single reply from a rep; it’s nice to see that level of engagement.
More importantly, this dialogue speaks volumes to Google’s stubbornness. Seth’s/Google’s position is: do it the Google way, sorry-not-sorry to all those that don’t fit into our model.
Like we haven’t heard of infrastructure as code? That can’t paper over basics like being unable to change your K8s cluster control plane. This is precisely the attitude that lands GCP as a distant #3 behind AWS and Azure.
Even still it’s not like it’s non-trivial to just bring up and drop clusters. Just setting up peering with cloud sql or https certs with GKE ingress can be fraught with timing issues that can torpedo the whole provisioning process.
How is this any helpful? Are they supposed to implement everything in terraform or similar, is that your suggestion? Why don't you completely remove the editable UI then, if whoever is using it is doing it wrong. What a typical arrogant and out of touch with customers Google response.
It could be totally unrelated but having an option such as equivalent TF along with REST and CLI options could dramatically speed up the configuration process.
Right now with k8s it is definit a 'ongoing maintenance'. We allocate around 0.5-2 pt per week on only doing that. If we would not do that, most of our stuff would be already outdated.
I know already too many people which are stuck at a certain k8s version. Do not allow that to happen!
> clusters should be treated like cattle, not pets.
Off-topic, but is this really how people do k8s these days? Years ago when I was at Google, each physical datacenter had at most several "clusters", which would have fifty thousand cores and run every job from every team. A single k8s cluster is already a task management system (with a lot of complexity), so what do people gain by having many clusters, other than more complexity?
The most common thing I've heard is "blast radius reduction", i.e. the general public are not yet smart enough to run large shared infrastructures. That seems something that should be obviously true.
People had exactly the same experiences with Mesos and OpenStack, but k8s has decent tooling for turning up many clusters, so there is an easy workaround
I still feel like that would only work in very niche cases.
I mean, if people aren't smart enough to run a large shared infrastructure, how can I trust them to run a large number of shared clusters, even if each cluster is small. The final scale is still the same.
And no SRE would allow you to run your application in a single cluster. Borg Cells were federated but not codependent - Google's biggest outages were due to the few components that did not sufficiently isolate clusters from one another.
Clusters are probably still pets to most orgs, but the lessons about how to manage complexity still apply. Each of my terraform state files is a pet and I treat it like such... but I also use change-control to assure that even though I don't regularly recreate it from scratch, I understand all that was there.
There are potentially quite a few benefits of being able to spin up clusters on demand [1]:
* Fully reproducible cluster builds and deployments.
* The type of cluster (can be) an implementation detail, making it easy to move between e.g Minikube, Kops, EKS, etc. After all, K8s is just a runtime.
* Developers can create temporary dev environments or replicas of other clusters
* Promote code through multiple environments from local Minikube clusters to cloud environments
* Version your applications and dependent infrastructure code together
* Simplify upgrades by launching a brand new cluster, migrating traffic and tearing the old one down (blue/green)
* Test in-place upgrades by launching a replica of an existing cluster to test the upgrade before repeating it in production
* Increase agility by making it easier to rearchitect your systems - if you have a pet, modifying the overall architecture can be painful
* Frequently test your disaster recovery processes as a by-product for no extra effort (sans data)
I think for one, you cannot easily have Masters span regions without risk of them falling out of communication. Similarly the workers should be located nearby. If there's a counterexample to this I'd love to see it.
> clusters should be treated like cattle, not pets
Heh... how many teams actually treat their clusters like cattle, though? Every time I advocate automation around cluster management, people start complaining that "you don't have to do that anymore, we have Kubernetes!"
Some people get it, yes, but even of that group, few have the political will/strength to make sure that automation is set up on the cluster level—especially to a point where you could migrate running production workloads between clusters without a potentially large outage / maintenance window.
For any real production system you have to use terraform and their ilk to manage clusters, as you need to be spinning up and down dev/qa/prod clusters.
I don't know GCP though. In the past I've seen kube cluster archs which are very very fragile as they spin up. If that's the case with GCP I can see why you wouldn't do the above and rather hand hold their creation.
I would love if this happened in the real world, but for every well-architected automated cluster management setup I’ve seen using Terraform, Ansible, or even shell scripts and bubble gum, there are five that were hand-configured in the console and poorly (or not at all) documented, and might not be able to re-create without a substantial multi-day effort.
> but clusters should be treated like cattle, not pets
Ha. They should, but they are absolutely not. Customers typically ask "why should we spend time on automating cluster deployment when we are going to do it just once?" and when I explain that it's for when the cluster goes away, if it goes away, they say it's an acceptable risk.
The truth of the matter is, even in some huge international companies, they don't have the resources to keep up with development of tools to have completely phoenix servers. They just want to write automation and have it work for the next 10 years, and that's definitely not the case.
> This is great feedback, but clusters should be treated like cattle, not pets. I'd love to learn more about why your clusters must be so static.
Clusters often are not "cattle". If your operation is big enough, then yes, they might be. Usually they aren't, they are named systems and represent mostly static entity, even if the components of said entity change every hour.
Personally, I'm running in production a cluster that by now had witnessed upgrades from 1.3 to 1.15, in GKE, with some deployments running nearly unchanged since then.
Treating it as cattle makes no sense, especially since on API level, the clusters aren't volatile elements.
I think our architects’ heads would explode if they were told we should treat them like cattle.
For us, Clusters are a promise to our developers. We can’t just spin up a new cluster because we feel like it. I must be missing something or maybe our culture is just different.
As an architect, I am currently working on our org's first cloud deployment initiative. Due to federal compliance / regulations, we have no write access in higher / production boundaries, and everything is automated via deployment jobs, IaC, etc. Given the experience of the teams involved, I took the opportunity (burden) of writing nearly all the automation. If your architects can't handle shooting sick cattle in prod, I'd say get new architects.
> If your architects can't handle shooting sick cattle in prod, I'd say get new architects.
For every 1 competent person who can develop a solution to fully automate everything, there are 99 others who can automate most of that, maybe minus a cluster or DB or two, and another 500 whom cannot do either, but can run a CentOS box at a reasonable service level.
You experience using great tools and your vast knowledge of k8s each day to do all that, and you have the support of your org, but those other folks may not have the tools, support, knowledge, or sometimes even the capability to attain the knowledge to do that. That doesn't mean they're useless to anyone, to be cast off at will!
The type of thinking that leads to, "get new engineers/developers/designers/architects if yours aren't perfect" needs to die, and needs to be replaced with, "let's do what we can to train and support our current employees to do a great job" because, frankly, there aren't enough "superstar" people who have your skills to do that at every org.
We need to work on accepting people for who they are-- helping them to strive to be a bit better each day of course-- and utilizing those skills in the right place, rather than trying to make everyone the same person with the same skills doing the same things.
Some applications don't need clusters which can be rebuilt and destroyed at will, so let's not make that the bar for every project.
> I'm sorry that you feel trapped, that's not our intention.
Please don't do this.
You can apologize for your actions work to improve in the future , but you cannot apologize for how someone feels as a result of your actions.
Also, intent doesn't matter unless you plan to change your behavior to undo or mitigate the unintended result.
> clusters should be treated like cattle, not pets
So my understanding that the official k8s way to upgrade your cluster is also to throw it away and start a new one (with some cloud provider proprietary alternatives).
Let's say there is something actually important, stateful, single-source-of-truth in my k8s cluster, like a relational DB that must not lose data. I don't want downtime for readers or writers, and I want at least one synchronous slave at all times (since the data is important). I also don't want to eat non-trivial latency overheads from setting up layers of indirection.
In this case, one needs fault-tolerance. One way to achieve it is through replication, where an extra copy (or perhaps a reconstruction receipt) of your DB instance runs somewhere else. Usually DBs achieves this through transactions. Additionally, you can have distributed DBs, which then use distributed transactions for achieving so.
I am not expert, but K8s handles task replication, and either spawn or route a request to another task instance somewhere else. However, the application logic itself must handle the fault-tolerance (by handling its states through transactions or something else) should an instance fail. K8s doesn't do that for you.
Distributed transactions are a non-starter – I already said I want to run master-slave with synchronous replication, which is basically what you want to do in >99% of cases where you have a DB with important stuff in it.
It is not really about what you want, but about how to migrate the DB* to another location while also minimizing its downtime/slower-ness.
You need to instantiate a secondary DB replica somewhere else and start the DB migration. Since there will be "two instances" of the same DB running, you will also need to set up a (temporary) proxy for routing and handling the DB requests w/ something like this:
1) if the data being requested is already migrated, request is handled by the (new) secondary replica.
2) Primary instance handles the request, otherwise
2.1) Requested data should be migrated to secondary replica (asynchronously, but note that a repeated request may invalidate a migration).
Turn the proxy router off once the whole state of your primary DB instance is fully migrated, making the secondary replica the primary one.
That's really just a napkin recipe for completing a live migration, though.
* We are now getting into the distributed transaction world because you can never be 100% sure that writing to 2 databases can succeed or fail at the same time. There is this talk from this guy who deals with similar problem you have: http://www.aviransplace.com/2015/12/15/safe-database-migrati...
Thanks for replying to feedback Seth. Stuff like this - following the Google Maps API massive pricing increase, the G Suite pricing increase - is what makes me wary about building stuff on GCP: I'm afraid that Google will increase prices for stuff I rely on. AWS has made users expect pricing for cloud services to only go down.
"In the old way of doing things, we treat our servers like pets, for example Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line."
This is a pretty good in depth explanation, but at a high level if a your server dies and you are extremely upset about it (similar to if your pet died) you are putting too many eggs in that single basket, with no secondary plan. Conversely if you build your infra in such a way that your server dying is something you see no worse than how a farmer sees one of his cattle dying (which are raised to be killed) - you are much better prepared for the inevitable downtime from your server and can very easily recover
These changes won't take effect until June - customers won't start getting billed immediately. I'm sorry that you feel trapped, that's not our intention.
> You should keep existing clusters in the pricing model they’ve been built in, and apply this change for clusters created after today.
This is great feedback, but clusters should be treated like cattle, not pets. I'd love to learn more about why your clusters must be so static.