Postgres-XL: Scalable Open Source PostgreSQL-based Database Cluster

NDizzle · on June 11, 2014

It's a huge pet peeve of mine, but why do people present network maps of these huge distributed things featuring ONE SINGLE load balancer in front of it all!?

joshpadnick · on June 11, 2014

Actually, I'm curious. What are best practices for setting up redundancy at the load balancer level? Options I've seen are:

- Hot standby that detects when the master is down and takes over as master when needed

- DNS-level solutions that distribute across multiple load balancers

But the DNS has a TTL that may not be honored by all ISPs, so how do you create a truly no single point of failure with the load balancer?

kilburn · on June 11, 2014

DNS-level solutions are oftentimes misunderstood, probably because there are a couple of things that can be done at this level:

1. "Geo-DNS" is about using an anycast network to direct users to their nearest datacenter(s). This does _not_ aid High Availability at all.

2. DNS Round Robin is about distributing the load between multiple IPs. As a load balancing solution it is relatively poor, because you have no control over the actual balancing and can end up receiving most users though a single IP.

3. DNS Failover solutions that replace the IP when the server goes down, which is also a poor solution because of TTL and non-TTL browser caches.

4. DNS Round Robin but for the High Availability, not for the balancing. This is actually an interesting approach because most modern browsers automatically switch to using (one of) the other record(s) when the IP they were using goes down (sorry, I have no reference that clearly states which browser do and which ones don't exhibit this behavior). In fact, there are some sources around [1] that seem to identify this approach as the only one to achieve instant failover in the face of datacenter-wide outages.

[1] http://serverfault.com/questions/69870/multiple-data-centers...

agwa · on June 12, 2014

I have experience using approach #4 for global high availability, and the main downside is that most network outages result in packet loss rather than returning an immediate ICMP error, so browsers will hang for about 30 seconds before timing out and trying the next A record.

My personal preference is to combine approaches #3 and #4: have all hosts in the round robin with a lowish TTL, but automatically remove any host that goes down.

kev009 · on June 12, 2014

1. If you can pull your routes, it sure does.

pilif · on June 11, 2014

I'm using two HAProxy machines and keepalived to ensure that if one goes down, the other takes over.

Because HTTP in general and with that HAProxy is very stateless, this is really easy and safe to do. I've had the opportunity to fail over multiple times already (mainly for software updates, once because of a hardware issue) and I never had any problems with the setup.

laggyluke · on June 12, 2014

This sounds like a split-brain disaster waiting to happen, when a network partition between the two load balancers makes both think that the other one is down.

regularfry · on June 12, 2014

This is why you only ever automatically fail over in one direction, and (ideally) can STONITH.

annnnd · on June 12, 2014

I was running a similar system. What solution would you propose?

laggyluke · on June 12, 2014

I'm no expert, but I think the main idea is to have an odd number of nodes plus a consensus protocol like Paxos or Raft to figure out who's actually alive. If a node finds itself in a split minority, it won't become a master. This doesn't work when you have only two nodes, as there's no minority after the split, both sides are equal. Building this from scratch is extremely difficult and error-prone, so people usually use systems like ZooKeeper.

modoc · on June 11, 2014

Your first option. A VIP with a heartbeat failover setup. DNS is useful when you have multiple end points and need to push more BW than a single LB/box can handle. But for anything under ~5 Gbit/sec, the Active-Hot Standby setup works great.

atombender · on June 11, 2014

Out of interest, how would you do this on a VPS like Digital Ocean, where IPs cannot be reassigned?

modoc · on June 12, 2014

You'd need some place that allows for VIPs I think... I've never used Digital Ocean. Sorry.

olefoo · on June 12, 2014

The hot-standby approach, using VRRP to manage a floating address between the two boxes does work and fails over quickly ( unnoticeably to most clients ) but getting it set up correctly is a project.

rbranson · on June 12, 2014

For a database it would be not having a load balancer at all and performing connection pooling / failover at the client.

cbsmith · on June 11, 2014

Because load balancers in a diagram are generally a high availability pool of load balancers.

opendais · on June 11, 2014

Because everyone loves a Single Point of Failure? /s

It is probably because they don't provide the failover and assume you have a standby LB and an active LB. So only one needs to be shown and you handle the failover on the application side rather on their side. At least, that is what I generally see that sort of thing in.

mason_s · on June 12, 2014

Because they are not good at diagrams? Just emphasizing the key components of the architecture.

omnibrain · on June 11, 2014

This issue is discussed (not by me) a bit more in depth here: https://plus.google.com/+KristianK%C3%B6hntopp/posts/HtQB6hJ...

dragonwriter · on June 12, 2014

Because while the load balancer shown might (and often, in practice, would) actually be a multi-component system of its own, that's a concern outside the focus of the diagram that distracts from it is attempting to illustrate.

mrweasel · on June 11, 2014

I understand that doing an database cluster is a hard problem and even harder to make easy to use, but this seems really complicated to set up. I'm not a fan of needing "transaction managers", "coordinators" and data nodes, I would like these to all run on the same server.

Of cause Postgres-XL may not exactly be designed for my use case where I just want to be able to have one node fail or take a server down to do upgrades. Ideally it would be as simple as the built in Postgresql replication as become.

TkTech · on June 11, 2014

RethinkDB is an interesting project which is exceptionally simple to cluster - it's actually their "Hello World!" example.

http://rethinkdb.com/

ddorian43 · on June 11, 2014

not even 1 node transactions, while postgres-xl has inter-node transactions

and sharding is currenty only-primary-key

mason_s · on June 12, 2014

It may be that coordinators and datanodes, while the same binary, become one and the same as the same process. At the moment that is not the case.

The Global Transaction Manager here ensures that transaction ids and snapshots (the context in which statements are run, the other running transactions), are exactly the same across the entire cluster. Without it, there are theoretical possibilities of inconsistencies based on the timing of concurrently running transactions and when connections to other data nodes are taken.

The GTM ensures cluster-wide Multi-version Concurrency Control whereby readers do not block writers and writers do not block readers and ACID properties are respected globally.

As for taking a server down for upgrades, well, that is beyond the scope of the project. There have been people looking at XL for simple load balancing and HA, we should perhaps consider making it easier for this use case, but I think some use pgpool for that.

manoDev · on June 11, 2014

The page is light on details, what is this exactly? A fork of Postgres, or something you run on top to orchestrate clustering?

icholy · on June 11, 2014

Postgres-XC + Postgres-R + StormDB

http://www.theregister.co.uk/2013/10/10/translattice_stormdb...

mwexler · on June 11, 2014

I know it's expected of TheReg, but the opening lines made me laugh: "TransLattice gobbles StormDB upstart for Postgres threesome.

Lashes PostgreSQL, -R, and Postgres-XC together for unholy database tryst"

twic · on June 11, 2014

Postgres-R! I knew about Postgres-R, but i thought it was a research project which had never really gone anywhere. I am highly intrigued to learn that it is still in play. From what i remember, there were some terrifically clever ideas about multi-master databases at the heart of it. I don't think there's anything else quite like it in the open source world.

mason_s · on June 12, 2014

Postgres-R bits have not made their way into Postgres-XL, but some parts of Postgres-XC have. Some XC code is merged in to make future merging into Postgres-XL easier, even parts that are not used.

Some parts of Postgres-R are in the TransLattice Elastic Database (TED), but that is a proprietary closed source product, sorry. (I work for TransLattice, who also open sourced Postgres-XL.)

twic · on June 13, 2014

Ah, i see. It's a shame that no descendant of Postgres-R is open source, but i'm still pleased to hear that there is one, even if it's proprietary. Thanks for the information.

mason_s · on June 12, 2014

I would think of it more as a branch. A big branch.

The intent is to always merge in upstream from PostgreSQL.

Anyway, you do not run on top of it, changes have been made deep in the bowels of PostgreSQL.

nusbit · on June 11, 2014

Has anyone used this in production?

barosl · on June 11, 2014

Oh, I had thought the link was about Postgres-XC. So it's a different project.

"The project includes architects and developers who previously worked on both Postgres-XC and Stado, and Postgres-XL contains code from Postgres-XC. The Postgres-XL project has its own philosophy and approach."

So should I see a brighter future for Postgres-XL rather than Postgres-XC? It seems that the Postgres-XC respository is still making some commits, though the rate is quite low.

techdragon · on June 12, 2014

From what I can see, Postgres-XC is the original project but Postgres-XL is sort of a 'stable fork'.

Its very murky and to be honest, very frustrating. I just want a Postgres that scales and doesn't cost me $$$ like FoundationDB does. The Postgres-XC team is far too insular and dont seem to care about outside users. I hope Postgres-XL can inspire them to push ahead and then hopefully we can see a postgres 9.4 based postgresql clustering soloution for write scalability. (and no Pgpool-2 doesnt count)

barosl · on June 12, 2014

I'm completely new to both Postgres-XC and Postgres-XL, and just started looking at the documentation. It seems that the main structures of the two are similar, except some features implemented in Postgres-XL like allowing node-to-node data transfer.

It is natural, because they've just forked the project. But I think if there are some gotchas in the original project, Postgres-XL would also expose similar problems. What have been your main concerns using Postgres-XC? Any design issues?

mason_s · on June 12, 2014

The planner and executor in Postgres-XL is quite different, and direct node-to-node data transfer for MPP parallelism results in big performance gains.

Any bugs in XC in the planner and executor would not show up in XL (that is not to say that we in Postgres-XL do not have any of our own bugs :-)).

Bugs related to GTM could show up in both, but in Postgres-XL additional precautions have been taken to reduce the likelihood of problems.

hardwaresofton · on June 11, 2014

The overview doesn't provide much actual info, but does "Global Transaction Monitor" sound like "single point of failure" to anybody?

I wonder why someone can't just add etcd to postgres and create self-sufficient distributed postgres nodes (I am oversimplifying, surely)

dragonwriter · on June 12, 2014

> The overview doesn't provide much actual info, but does "Global Transaction Monitor" sound like "single point of failure" to anybody?

Since you can have more than one of them, no.

hardwaresofton · on June 12, 2014

Right, but generally, why not just integrate that functionality with the actual data nodes?

In the diagram they have 2, but where you have two, you can add more for more reliability, right? following that thinking down the rabbit hole, and you've got distributed transaction monitoring across the data nodes(paxos/raft)... Just wondering why that wasn't the goal

mason_s · on June 12, 2014

It is possible to configure a GTM Standby on another node to take over for GTM. It currently requires manual failover, however.

philtar · on June 11, 2014

For anyone who wants to help: I've been looking at a bunch of different ways to achieve high-availability postgres. What approach would you all suggest for a solo startup on a limited budget

netshade · on June 11, 2014

I had a hell of a time getting PgPool II w/ PG Streaming Replication set up right. The trivial cases seemed to be fine, but when I started triggering failovers back and forth, I ran into a lot of cases where PgPool II would go stale due to a data file left around in /tmp.

I eventually got it working, but there were way too many informally created scripts that PgPool and PG had to know to trigger failovers, initiate resyncs from WALs, etc. I didn't like it at all, and around about then AWS started offering PG on RDS, so I just moved to that.

So, my advice would be, unless you've got someone on team for who that isn't that much work, you get a lot of benefit from going w/ hosted. RDS Postgres has been pretty great - not exceptional, but for my use cases, okay. Hoping they add cross region read replicas for PG sometime soon, as that would make a lot of expansion opportunities really easy.

michaelbuckbee · on June 11, 2014

I'm worried that this is going to sound flippant (not my intention), but use Heroku. You're under significant time constraints with being a solo founder and it would most likely be a win to just not have to worry about this one technical aspect.

gaadd33 · on June 12, 2014

Why recommend Heroku for that as opposed to AWS RDS? The pricing seems to be significant (Heroku "Ika" is $750/mo, db.m3.large multi-az is $280/mo)

michaelbuckbee · on June 12, 2014

Well, the main reason was that I didn't realize that AWS RDS was offering Postgres now (looks like it is still in beta, but awesome news nonetheless).

mason_s · on June 12, 2014

Nothing wrong with staying that at all. It is easier to get started with. We should try to create some AMI images to make it a little easier to get up and running at AWS with Postgres-XL.

threeseed · on June 11, 2014

I am also running a solo startup. Don't run your own database. It is an unwanted distraction and data loss due to poor backup procedures (i.e. do you actually test backup/restore) can often kill your startup before it has a chance to take off.

Plenty of great hosted database solutions e.g. AWS RDS (PostgreSQL, MySQL, Oracle), MongoLab, MongoHQ, Cloudant, Firebase, Azure etc.

jwheeler79 · on June 11, 2014

+1 for the reality that an untested backup is potentially no backup at all. it's easy to have a false sense of security about these things

joevandyk · on June 11, 2014

http://www.packtpub.com/postgresql-replication/book is a great book about how postgresql replication works. Thing that I love about the book is that it doesn't just describe how to setup replication, it nicely describes how postgresql works (disk layout, the wal/transaction logs, fsync consistency, checkpoints, etc).

I'd recommend doing a master-slave setup. Doing a master-master is a lot more complicated. If the master goes down, you can have monitoring software promote a slave to be the new master.

welder · on June 11, 2014

For a solo startup on a limited budget I would recommend a single-server Postgres setup. Don't solve scaling problems until they become a real problem.

dllthomas · on June 11, 2014

Scaling is one thing; high availability is another. Depending on the product, the latter might well be important early.

pilif · on June 11, 2014

Tough for HA, default Postgres has everything you need to get started.

The one thing that's missing is automated failover, but personally, I wouldn't do that with a database because from experience, from all the issues that would cause an automated failover, only a very small part would actually be solved by such action, whereas most of them would probably be made a lot worse.

Especially considering that undoing a failover is very costly (in terms of time, not necessarily money) until we get 9.4 where failing back to the original master won't need a full resync.

modoc · on June 11, 2014

As much as I dislike lots about Oracle, the Oracle DB RAC cluster sets the bar pretty high, imho. Multi-master with automated failover that just works.

yunong · on June 11, 2014

I've been working on an automated failover framework based on Zookeeper and ZFS for some time now. Whilst we use it in production, it's been hampered by bugs we keep hitting in the PG replication layer. Coupled with the problem that the PG WAL includes checkpoint data in addition to transaction logs, it makes it quite difficult to get right and make the system completely hands off.

dllthomas · on June 11, 2014

Sure, didn't mean to imply that I thought this was a big advancement there (haven't looked at the options deep enough to tell, this included), just that the two concerns are different and YAGNI may apply to scale but not HA (though it can certainly apply to HA as well).

mason_s · on June 12, 2014

Yes, but ensure that you have backups, and possibly a warm standby.

Postgres-XL is hopefully at least reassuring that once those scaling problems do crop up, there is something available that can help.

adamnemecek · on June 11, 2014