Full Disclosure: I work for Red Hat in the Container and PaaS Practice in Consulting.
At Red Hat, we define an HA OpenShift/Kubernetes cluster as 3x3xN (3 masters, 3 infra nodes, 3 or more app nodes) [0] which means the API, etcd, the hosted local Container Registry, the Routers, and the App Nodes all provide (N-1)/2 fault tolerance.
Not to brag, since we're well practiced at this, but I can get a 3x3x3 cluster in a few hours, I've lead customer to a basic 3x3x3 install (no hands on keyboard) in less than 2 days, and our consultants are able to install a cluster in 3-5 working days about 90% of the time, even with impediments like corporate proxies, wonky DNS or AD/LDAP, not so Enterprise Load Balancers, and disconnected installs. Making a cluster read for production is about right-sizing and doing good testing.
Worth mentioning that my "got a cluster working in a month" time frame includes starting with zero Kubernetes experience, and no etcd ops experience. Using kops, pretty much anybody can get a full HA cluster running in about 15 minutes. On top of that, it's maybe 5 more minutes to deploy all the addons you'd expect for running production apps on a cloud-backed cluster.
The great thing about automation is that once you have these basic tools (Prom/Graf monitoring/alerting, ELK, node pool autoscaling, CI/CD) implemented as declarative manifests, they're deployable anywhere in minutes.
would be good if the "Enterprise Load Balancer" would just be another set of servers (with HAProxy + keepalived or something else, I love the "single ip" failover)
Edit: especially load balacing the master servers. (that's actually the hard part of k8s, not even setting it up with/out openshift/ansible whatever)
load balancing services on k8s itself is basically just running either calico network and use one or two haproxy deployments of size 1 with a ip annotation or just using https://github.com/kubernetes/contrib/tree/master/keepalived...
At Red Hat, we define an HA OpenShift/Kubernetes cluster as 3x3xN (3 masters, 3 infra nodes, 3 or more app nodes) [0] which means the API, etcd, the hosted local Container Registry, the Routers, and the App Nodes all provide (N-1)/2 fault tolerance.
Not to brag, since we're well practiced at this, but I can get a 3x3x3 cluster in a few hours, I've lead customer to a basic 3x3x3 install (no hands on keyboard) in less than 2 days, and our consultants are able to install a cluster in 3-5 working days about 90% of the time, even with impediments like corporate proxies, wonky DNS or AD/LDAP, not so Enterprise Load Balancers, and disconnected installs. Making a cluster read for production is about right-sizing and doing good testing.
[0] http://v1.uncontained.io/playbooks/installation/#cluster-des...