It's a huge pet peeve of mine, but why do people present network maps of these huge distributed things featuring ONE SINGLE load balancer in front of it all!?
DNS-level solutions are oftentimes misunderstood, probably because there are a couple of things that can be done at this level:
1. "Geo-DNS" is about using an anycast network to direct users to their nearest datacenter(s). This does _not_ aid High Availability at all.
2. DNS Round Robin is about distributing the load between multiple IPs. As a load balancing solution it is relatively poor, because you have no control over the actual balancing and can end up receiving most users though a single IP.
3. DNS Failover solutions that replace the IP when the server goes down, which is also a poor solution because of TTL and non-TTL browser caches.
4. DNS Round Robin but for the High Availability, not for the balancing. This is actually an interesting approach because most modern browsers automatically switch to using (one of) the other record(s) when the IP they were using goes down (sorry, I have no reference that clearly states which browser do and which ones don't exhibit this behavior). In fact, there are some sources around [1] that seem to identify this approach as the only one to achieve instant failover in the face of datacenter-wide outages.
I have experience using approach #4 for global high availability, and the main downside is that most network outages result in packet loss rather than returning an immediate ICMP error, so browsers will hang for about 30 seconds before timing out and trying the next A record.
My personal preference is to combine approaches #3 and #4: have all hosts in the round robin with a lowish TTL, but automatically remove any host that goes down.
I'm using two HAProxy machines and keepalived to ensure that if one goes down, the other takes over.
Because HTTP in general and with that HAProxy is very stateless, this is really easy and safe to do. I've had the opportunity to fail over multiple times already (mainly for software updates, once because of a hardware issue) and I never had any problems with the setup.
This sounds like a split-brain disaster waiting to happen, when a network partition between the two load balancers makes both think that the other one is down.
I'm no expert, but I think the main idea is to have an odd number of nodes plus a consensus protocol like Paxos or Raft to figure out who's actually alive.
If a node finds itself in a split minority, it won't become a master.
This doesn't work when you have only two nodes, as there's no minority after the split, both sides are equal.
Building this from scratch is extremely difficult and error-prone, so people usually use systems like ZooKeeper.
Your first option. A VIP with a heartbeat failover setup. DNS is useful when you have multiple end points and need to push more BW than a single LB/box can handle. But for anything under ~5 Gbit/sec, the Active-Hot Standby setup works great.
The hot-standby approach, using VRRP to manage a floating address between the two boxes does work and fails over quickly ( unnoticeably to most clients ) but getting it set up correctly is a project.
Because everyone loves a Single Point of Failure? /s
It is probably because they don't provide the failover and assume you have a standby LB and an active LB. So only one needs to be shown and you handle the failover on the application side rather on their side. At least, that is what I generally see that sort of thing in.
Because while the load balancer shown might (and often, in practice, would) actually be a multi-component system of its own, that's a concern outside the focus of the diagram that distracts from it is attempting to illustrate.
I understand that doing an database cluster is a hard problem and even harder to make easy to use, but this seems really complicated to set up. I'm not a fan of needing "transaction managers", "coordinators" and data nodes, I would like these to all run on the same server.
Of cause Postgres-XL may not exactly be designed for my use case where I just want to be able to have one node fail or take a server down to do upgrades. Ideally it would be as simple as the built in Postgresql replication as become.
It may be that coordinators and datanodes, while the same binary, become one and the same as the same process. At the moment that is not the case.
The Global Transaction Manager here ensures that transaction ids and snapshots (the context in which statements are run, the other running transactions), are exactly the same across the entire cluster. Without it, there are theoretical possibilities of inconsistencies based on the timing of concurrently running transactions and when connections to other data nodes are taken.
The GTM ensures cluster-wide Multi-version Concurrency Control whereby readers do not block writers and writers do not block readers and ACID properties are respected globally.
As for taking a server down for upgrades, well, that is beyond the scope of the project. There have been people looking at XL for simple load balancing and HA, we should perhaps consider making it easier for this use case, but I think some use pgpool for that.
Postgres-R! I knew about Postgres-R, but i thought it was a research project which had never really gone anywhere. I am highly intrigued to learn that it is still in play. From what i remember, there were some terrifically clever ideas about multi-master databases at the heart of it. I don't think there's anything else quite like it in the open source world.
Postgres-R bits have not made their way into Postgres-XL, but some parts of Postgres-XC have. Some XC code is merged in to make future merging into Postgres-XL easier, even parts that are not used.
Some parts of Postgres-R are in the TransLattice Elastic Database (TED), but that is a proprietary closed source product, sorry. (I work for TransLattice, who also open sourced Postgres-XL.)
Ah, i see. It's a shame that no descendant of Postgres-R is open source, but i'm still pleased to hear that there is one, even if it's proprietary. Thanks for the information.
Oh, I had thought the link was about Postgres-XC. So it's a different project.
"The project includes architects and developers who previously worked on both Postgres-XC and Stado, and Postgres-XL contains code from Postgres-XC. The Postgres-XL project has its own philosophy and approach."
So should I see a brighter future for Postgres-XL rather than Postgres-XC? It seems that the Postgres-XC respository is still making some commits, though the rate is quite low.
From what I can see, Postgres-XC is the original project but Postgres-XL is sort of a 'stable fork'.
Its very murky and to be honest, very frustrating. I just want a Postgres that scales and doesn't cost me $$$ like FoundationDB does. The Postgres-XC team is far too insular and dont seem to care about outside users. I hope Postgres-XL can inspire them to push ahead and then hopefully we can see a postgres 9.4 based postgresql clustering soloution for write scalability. (and no Pgpool-2 doesnt count)
I'm completely new to both Postgres-XC and Postgres-XL, and just started looking at the documentation. It seems that the main structures of the two are similar, except some features implemented in Postgres-XL like allowing node-to-node data transfer.
It is natural, because they've just forked the project. But I think if there are some gotchas in the original project, Postgres-XL would also expose similar problems. What have been your main concerns using Postgres-XC? Any design issues?
The planner and executor in Postgres-XL is quite different, and direct node-to-node data transfer for MPP parallelism results in big performance gains.
Any bugs in XC in the planner and executor would not show up in XL (that is not to say that we in Postgres-XL do not have any of our own bugs :-)).
Bugs related to GTM could show up in both, but in Postgres-XL additional precautions have been taken to reduce the likelihood of problems.
Right, but generally, why not just integrate that functionality with the actual data nodes?
In the diagram they have 2, but where you have two, you can add more for more reliability, right? following that thinking down the rabbit hole, and you've got distributed transaction monitoring across the data nodes(paxos/raft)... Just wondering why that wasn't the goal
For anyone who wants to help: I've been looking at a bunch of different ways to achieve high-availability postgres. What approach would you all suggest for a solo startup on a limited budget
I had a hell of a time getting PgPool II w/ PG Streaming Replication set up right. The trivial cases seemed to be fine, but when I started triggering failovers back and forth, I ran into a lot of cases where PgPool II would go stale due to a data file left around in /tmp.
I eventually got it working, but there were way too many informally created scripts that PgPool and PG had to know to trigger failovers, initiate resyncs from WALs, etc. I didn't like it at all, and around about then AWS started offering PG on RDS, so I just moved to that.
So, my advice would be, unless you've got someone on team for who that isn't that much work, you get a lot of benefit from going w/ hosted. RDS Postgres has been pretty great - not exceptional, but for my use cases, okay. Hoping they add cross region read replicas for PG sometime soon, as that would make a lot of expansion opportunities really easy.
I'm worried that this is going to sound flippant (not my intention), but use Heroku. You're under significant time constraints with being a solo founder and it would most likely be a win to just not have to worry about this one technical aspect.
Nothing wrong with staying that at all. It is easier to get started with. We should try to create some AMI images to make it a little easier to get up and running at AWS with Postgres-XL.
I am also running a solo startup. Don't run your own database. It is an unwanted distraction and data loss due to poor backup procedures (i.e. do you actually test backup/restore) can often kill your startup before it has a chance to take off.
Plenty of great hosted database solutions e.g. AWS RDS (PostgreSQL, MySQL, Oracle), MongoLab, MongoHQ, Cloudant, Firebase, Azure etc.
http://www.packtpub.com/postgresql-replication/book is a great book about how postgresql replication works. Thing that I love about the book is that it doesn't just describe how to setup replication, it nicely describes how postgresql works (disk layout, the wal/transaction logs, fsync consistency, checkpoints, etc).
I'd recommend doing a master-slave setup. Doing a master-master is a lot more complicated. If the master goes down, you can have monitoring software promote a slave to be the new master.
For a solo startup on a limited budget I would recommend a single-server Postgres setup. Don't solve scaling problems until they become a real problem.
Tough for HA, default Postgres has everything you need to get started.
The one thing that's missing is automated failover, but personally, I wouldn't do that with a database because from experience, from all the issues that would cause an automated failover, only a very small part would actually be solved by such action, whereas most of them would probably be made a lot worse.
Especially considering that undoing a failover is very costly (in terms of time, not necessarily money) until we get 9.4 where failing back to the original master won't need a full resync.
As much as I dislike lots about Oracle, the Oracle DB RAC cluster sets the bar pretty high, imho. Multi-master with automated failover that just works.
I've been working on an automated failover framework based on Zookeeper and ZFS for some time now. Whilst we use it in production, it's been hampered by bugs we keep hitting in the PG replication layer. Coupled with the problem that the PG WAL includes checkpoint data in addition to transaction logs, it makes it quite difficult to get right and make the system completely hands off.
Sure, didn't mean to imply that I thought this was a big advancement there (haven't looked at the options deep enough to tell, this included), just that the two concerns are different and YAGNI may apply to scale but not HA (though it can certainly apply to HA as well).
The site is a little thin on details and internals. I would be very interested to learn more about how they handle failovers whilst still guaranteeing "strong consistency".
Failovers are currently handled outside of Postgres-XL.
Consistency is achieved because all of the nodes use the same transaction ids and snapshots (list of running transactions), via the Global Transaction Manager. There is no need to worry about statements being executed in a different order across the nodes.
Well, "synchronous replication" is used in a couple of ways here.
Tables can be designated as replicated or distributed (sharded). Replicated tables are typically fairly static. These are handled synchronously in the cluster on every datanode where the table resides. Actually, it first applies to a designated "primary" node, and upon completion, it will execute on the other nodes. The reason for this is to reduce the chance for deadlocks; if it succeeds on the primary, it has obtained all of the needed locks and we can be sure we can then get the locks on the other nodes.
In addition, the term synchronous replication is also used as in PostgreSQL's replication, but in Postgres-XL for a datanode having a standby replica, synchronously. It is a warm, non-queryable standby.
With regards to a WAN environment, Postgres-XL is not designed for that because of the latency, and for handling long network outages. If there is enough interest, that could be an area to enhance in the future, but consistency may be compromised.
I didn't get replication out of the description. It sounded more like distributed storage and load. Notice they say "warm-standby" for data nodes. I can only think of one reason anybody would resort to STONITH for something like this.