Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Postgres-XL: Scalable Open Source PostgreSQL-based Database Cluster (postgres-xl.org)
152 points by icholy on June 11, 2014 | hide | past | favorite | 66 comments


It's a huge pet peeve of mine, but why do people present network maps of these huge distributed things featuring ONE SINGLE load balancer in front of it all!?


Actually, I'm curious. What are best practices for setting up redundancy at the load balancer level? Options I've seen are:

- Hot standby that detects when the master is down and takes over as master when needed

- DNS-level solutions that distribute across multiple load balancers

But the DNS has a TTL that may not be honored by all ISPs, so how do you create a truly no single point of failure with the load balancer?


DNS-level solutions are oftentimes misunderstood, probably because there are a couple of things that can be done at this level:

1. "Geo-DNS" is about using an anycast network to direct users to their nearest datacenter(s). This does _not_ aid High Availability at all.

2. DNS Round Robin is about distributing the load between multiple IPs. As a load balancing solution it is relatively poor, because you have no control over the actual balancing and can end up receiving most users though a single IP.

3. DNS Failover solutions that replace the IP when the server goes down, which is also a poor solution because of TTL and non-TTL browser caches.

4. DNS Round Robin but for the High Availability, not for the balancing. This is actually an interesting approach because most modern browsers automatically switch to using (one of) the other record(s) when the IP they were using goes down (sorry, I have no reference that clearly states which browser do and which ones don't exhibit this behavior). In fact, there are some sources around [1] that seem to identify this approach as the only one to achieve instant failover in the face of datacenter-wide outages.

[1] http://serverfault.com/questions/69870/multiple-data-centers...


I have experience using approach #4 for global high availability, and the main downside is that most network outages result in packet loss rather than returning an immediate ICMP error, so browsers will hang for about 30 seconds before timing out and trying the next A record.

My personal preference is to combine approaches #3 and #4: have all hosts in the round robin with a lowish TTL, but automatically remove any host that goes down.


1. If you can pull your routes, it sure does.


I'm using two HAProxy machines and keepalived to ensure that if one goes down, the other takes over.

Because HTTP in general and with that HAProxy is very stateless, this is really easy and safe to do. I've had the opportunity to fail over multiple times already (mainly for software updates, once because of a hardware issue) and I never had any problems with the setup.


This sounds like a split-brain disaster waiting to happen, when a network partition between the two load balancers makes both think that the other one is down.


This is why you only ever automatically fail over in one direction, and (ideally) can STONITH.


I was running a similar system. What solution would you propose?


I'm no expert, but I think the main idea is to have an odd number of nodes plus a consensus protocol like Paxos or Raft to figure out who's actually alive. If a node finds itself in a split minority, it won't become a master. This doesn't work when you have only two nodes, as there's no minority after the split, both sides are equal. Building this from scratch is extremely difficult and error-prone, so people usually use systems like ZooKeeper.


Your first option. A VIP with a heartbeat failover setup. DNS is useful when you have multiple end points and need to push more BW than a single LB/box can handle. But for anything under ~5 Gbit/sec, the Active-Hot Standby setup works great.


Out of interest, how would you do this on a VPS like Digital Ocean, where IPs cannot be reassigned?


You'd need some place that allows for VIPs I think... I've never used Digital Ocean. Sorry.


The hot-standby approach, using VRRP to manage a floating address between the two boxes does work and fails over quickly ( unnoticeably to most clients ) but getting it set up correctly is a project.


For a database it would be not having a load balancer at all and performing connection pooling / failover at the client.


Because load balancers in a diagram are generally a high availability pool of load balancers.


Because everyone loves a Single Point of Failure? /s

It is probably because they don't provide the failover and assume you have a standby LB and an active LB. So only one needs to be shown and you handle the failover on the application side rather on their side. At least, that is what I generally see that sort of thing in.


Because they are not good at diagrams? Just emphasizing the key components of the architecture.


This issue is discussed (not by me) a bit more in depth here: https://plus.google.com/+KristianK%C3%B6hntopp/posts/HtQB6hJ...


Because while the load balancer shown might (and often, in practice, would) actually be a multi-component system of its own, that's a concern outside the focus of the diagram that distracts from it is attempting to illustrate.


I understand that doing an database cluster is a hard problem and even harder to make easy to use, but this seems really complicated to set up. I'm not a fan of needing "transaction managers", "coordinators" and data nodes, I would like these to all run on the same server.

Of cause Postgres-XL may not exactly be designed for my use case where I just want to be able to have one node fail or take a server down to do upgrades. Ideally it would be as simple as the built in Postgresql replication as become.


RethinkDB is an interesting project which is exceptionally simple to cluster - it's actually their "Hello World!" example.

http://rethinkdb.com/


not even 1 node transactions, while postgres-xl has inter-node transactions

and sharding is currenty only-primary-key


It may be that coordinators and datanodes, while the same binary, become one and the same as the same process. At the moment that is not the case.

The Global Transaction Manager here ensures that transaction ids and snapshots (the context in which statements are run, the other running transactions), are exactly the same across the entire cluster. Without it, there are theoretical possibilities of inconsistencies based on the timing of concurrently running transactions and when connections to other data nodes are taken.

The GTM ensures cluster-wide Multi-version Concurrency Control whereby readers do not block writers and writers do not block readers and ACID properties are respected globally.

As for taking a server down for upgrades, well, that is beyond the scope of the project. There have been people looking at XL for simple load balancing and HA, we should perhaps consider making it easier for this use case, but I think some use pgpool for that.


The page is light on details, what is this exactly? A fork of Postgres, or something you run on top to orchestrate clustering?



I know it's expected of TheReg, but the opening lines made me laugh: "TransLattice gobbles StormDB upstart for Postgres threesome.

Lashes PostgreSQL, -R, and Postgres-XC together for unholy database tryst"


Postgres-R! I knew about Postgres-R, but i thought it was a research project which had never really gone anywhere. I am highly intrigued to learn that it is still in play. From what i remember, there were some terrifically clever ideas about multi-master databases at the heart of it. I don't think there's anything else quite like it in the open source world.


Postgres-R bits have not made their way into Postgres-XL, but some parts of Postgres-XC have. Some XC code is merged in to make future merging into Postgres-XL easier, even parts that are not used.

Some parts of Postgres-R are in the TransLattice Elastic Database (TED), but that is a proprietary closed source product, sorry. (I work for TransLattice, who also open sourced Postgres-XL.)


Ah, i see. It's a shame that no descendant of Postgres-R is open source, but i'm still pleased to hear that there is one, even if it's proprietary. Thanks for the information.


I would think of it more as a branch. A big branch.

The intent is to always merge in upstream from PostgreSQL.

Anyway, you do not run on top of it, changes have been made deep in the bowels of PostgreSQL.


Has anyone used this in production?


Oh, I had thought the link was about Postgres-XC. So it's a different project.

"The project includes architects and developers who previously worked on both Postgres­-XC and Stado, and Postgres-XL contains code from Postgres-­XC. The Postgres-XL project has its own philosophy and approach."

So should I see a brighter future for Postgres-XL rather than Postgres-XC? It seems that the Postgres-XC respository is still making some commits, though the rate is quite low.


From what I can see, Postgres-XC is the original project but Postgres-XL is sort of a 'stable fork'.

Its very murky and to be honest, very frustrating. I just want a Postgres that scales and doesn't cost me $$$ like FoundationDB does. The Postgres-XC team is far too insular and dont seem to care about outside users. I hope Postgres-XL can inspire them to push ahead and then hopefully we can see a postgres 9.4 based postgresql clustering soloution for write scalability. (and no Pgpool-2 doesnt count)


I'm completely new to both Postgres-XC and Postgres-XL, and just started looking at the documentation. It seems that the main structures of the two are similar, except some features implemented in Postgres-XL like allowing node-to-node data transfer.

It is natural, because they've just forked the project. But I think if there are some gotchas in the original project, Postgres-XL would also expose similar problems. What have been your main concerns using Postgres-XC? Any design issues?


The planner and executor in Postgres-XL is quite different, and direct node-to-node data transfer for MPP parallelism results in big performance gains.

Any bugs in XC in the planner and executor would not show up in XL (that is not to say that we in Postgres-XL do not have any of our own bugs :-)).

Bugs related to GTM could show up in both, but in Postgres-XL additional precautions have been taken to reduce the likelihood of problems.


The overview doesn't provide much actual info, but does "Global Transaction Monitor" sound like "single point of failure" to anybody?

I wonder why someone can't just add etcd to postgres and create self-sufficient distributed postgres nodes (I am oversimplifying, surely)


> The overview doesn't provide much actual info, but does "Global Transaction Monitor" sound like "single point of failure" to anybody?

Since you can have more than one of them, no.


Right, but generally, why not just integrate that functionality with the actual data nodes?

In the diagram they have 2, but where you have two, you can add more for more reliability, right? following that thinking down the rabbit hole, and you've got distributed transaction monitoring across the data nodes(paxos/raft)... Just wondering why that wasn't the goal


It is possible to configure a GTM Standby on another node to take over for GTM. It currently requires manual failover, however.


For anyone who wants to help: I've been looking at a bunch of different ways to achieve high-availability postgres. What approach would you all suggest for a solo startup on a limited budget


I had a hell of a time getting PgPool II w/ PG Streaming Replication set up right. The trivial cases seemed to be fine, but when I started triggering failovers back and forth, I ran into a lot of cases where PgPool II would go stale due to a data file left around in /tmp.

I eventually got it working, but there were way too many informally created scripts that PgPool and PG had to know to trigger failovers, initiate resyncs from WALs, etc. I didn't like it at all, and around about then AWS started offering PG on RDS, so I just moved to that.

So, my advice would be, unless you've got someone on team for who that isn't that much work, you get a lot of benefit from going w/ hosted. RDS Postgres has been pretty great - not exceptional, but for my use cases, okay. Hoping they add cross region read replicas for PG sometime soon, as that would make a lot of expansion opportunities really easy.


I'm worried that this is going to sound flippant (not my intention), but use Heroku. You're under significant time constraints with being a solo founder and it would most likely be a win to just not have to worry about this one technical aspect.


Why recommend Heroku for that as opposed to AWS RDS? The pricing seems to be significant (Heroku "Ika" is $750/mo, db.m3.large multi-az is $280/mo)


Well, the main reason was that I didn't realize that AWS RDS was offering Postgres now (looks like it is still in beta, but awesome news nonetheless).


Nothing wrong with staying that at all. It is easier to get started with. We should try to create some AMI images to make it a little easier to get up and running at AWS with Postgres-XL.


I am also running a solo startup. Don't run your own database. It is an unwanted distraction and data loss due to poor backup procedures (i.e. do you actually test backup/restore) can often kill your startup before it has a chance to take off.

Plenty of great hosted database solutions e.g. AWS RDS (PostgreSQL, MySQL, Oracle), MongoLab, MongoHQ, Cloudant, Firebase, Azure etc.


+1 for the reality that an untested backup is potentially no backup at all. it's easy to have a false sense of security about these things


http://www.packtpub.com/postgresql-replication/book is a great book about how postgresql replication works. Thing that I love about the book is that it doesn't just describe how to setup replication, it nicely describes how postgresql works (disk layout, the wal/transaction logs, fsync consistency, checkpoints, etc).

I'd recommend doing a master-slave setup. Doing a master-master is a lot more complicated. If the master goes down, you can have monitoring software promote a slave to be the new master.


For a solo startup on a limited budget I would recommend a single-server Postgres setup. Don't solve scaling problems until they become a real problem.


Scaling is one thing; high availability is another. Depending on the product, the latter might well be important early.


Tough for HA, default Postgres has everything you need to get started.

The one thing that's missing is automated failover, but personally, I wouldn't do that with a database because from experience, from all the issues that would cause an automated failover, only a very small part would actually be solved by such action, whereas most of them would probably be made a lot worse.

Especially considering that undoing a failover is very costly (in terms of time, not necessarily money) until we get 9.4 where failing back to the original master won't need a full resync.


As much as I dislike lots about Oracle, the Oracle DB RAC cluster sets the bar pretty high, imho. Multi-master with automated failover that just works.


I've been working on an automated failover framework based on Zookeeper and ZFS for some time now. Whilst we use it in production, it's been hampered by bugs we keep hitting in the PG replication layer. Coupled with the problem that the PG WAL includes checkpoint data in addition to transaction logs, it makes it quite difficult to get right and make the system completely hands off.


Sure, didn't mean to imply that I thought this was a big advancement there (haven't looked at the options deep enough to tell, this included), just that the two concerns are different and YAGNI may apply to scale but not HA (though it can certainly apply to HA as well).


Yes, but ensure that you have backups, and possibly a warm standby.

Postgres-XL is hopefully at least reassuring that once those scaling problems do crop up, there is something available that can help.



Here's a quick walkthrough of how to set up master/slave streaming replication over SSL:

http://www.rassoc.com/gregr/weblog/2013/02/16/zero-to-postgr...

This will work over the internet between data centers if you're so inclined...depending on your traffic, a simple way to get started.


The site is a little thin on details and internals. I would be very interested to learn more about how they handle failovers whilst still guaranteeing "strong consistency".


Failovers are currently handled outside of Postgres-XL.

Consistency is achieved because all of the nodes use the same transaction ids and snapshots (list of running transactions), via the Global Transaction Manager. There is no need to worry about statements being executed in a different order across the nodes.


Looks good. I wish Aphyr would unleash his Knossos on it. That would be very informative. In other words wonder how it handles network partitions...


Hopefully we'll find out more on June 17th: http://www.postgres-xl.org/2014/06/learn-about-postgres-xl-i...

I can't wait!


Does it support a kind of Oracle RAC TAF Select?

http://www.dba-oracle.com/art_oramag_rac_taf.htm


From the description it sounds as if this provides synchronous replication. How does this work in a WAN environment?


Well, "synchronous replication" is used in a couple of ways here.

Tables can be designated as replicated or distributed (sharded). Replicated tables are typically fairly static. These are handled synchronously in the cluster on every datanode where the table resides. Actually, it first applies to a designated "primary" node, and upon completion, it will execute on the other nodes. The reason for this is to reduce the chance for deadlocks; if it succeeds on the primary, it has obtained all of the needed locks and we can be sure we can then get the locks on the other nodes.

In addition, the term synchronous replication is also used as in PostgreSQL's replication, but in Postgres-XL for a datanode having a standby replica, synchronously. It is a warm, non-queryable standby.

With regards to a WAN environment, Postgres-XL is not designed for that because of the latency, and for handling long network outages. If there is enough interest, that could be an area to enhance in the future, but consistency may be compromised.


I didn't get replication out of the description. It sounded more like distributed storage and load. Notice they say "warm-standby" for data nodes. I can only think of one reason anybody would resort to STONITH for something like this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: