Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article convinced me that talking about databases as CP or AP is not very useful.

The article did not convince me that CAP-centered thinking is itself harmful or counterproductive. It's true that C-Consistency and A-Availability are relatively blunt definitions in a landscape of diverse and subtle possible semantics. But their usefulness is that they represent our intuitive notions of what a database should do.

If it were possible and had good performance, every single database system would provide C, A, and P, because it would make distributed databases much easier to reason about and program against. Intuitively, it really seems like a database ought to read the data that was most recently written. And it sure would be nice if a distributed database could somehow magically remain available and consistent even in the presence of a partition.

C, A, and P are useful because they are a yardstick against which we can measure a distributed database's actual capabilities against what a distributed database could do in an ideal world. The real world will fall short of that, and the CAP theorem gives a vocabulary for which part(s) of the ideal world real databases cannot satisfy.

For example, even in this article CAP terminology is used to describe interesting edge cases in ZooKeeper which, despite having generally strong consistency guarantees, does not always guarantee full capital-C Consistency.



> The article did not convince me that CAP-centered thinking is itself harmful or counterproductive. It's true that C-Consistency and A-Availability are relatively blunt definitions in a landscape of diverse and subtle possible semantics. But their usefulness is that they represent our intuitive notions of what a database should do.

I duno, I think the two major points of the article were: 1. CAP-centered thinking can be harmful or counterproductive; and 2. by virtue of the fact that most traditional database systems don't fall into CP/AP categories, maybe those categories don't really have much to do with what we expect of databases. One could go a step further and make a case that 1 and 2 imply that CAP-centered thinking for many use-cases is harmful or counterproductive.

But I'm not gonna do that here. I do not want that hot potato.


It really depends on the amount of writes your database needs to handle, and the overall quantity of data. There are very strong CAP systems (Zookeeper comes to mind), but I wouldn't want that for very large quantities of data, as it wouldn't be able to keep up.

For example, if you are doing logging for a few million simultaneous users (say 5-8 per resource request across service layers), a single system wouldn't be able to keep up, and definitely a single system that has to coordinate each write as an atomic/consistent state.

The fact is, depending on your needs, you will need to sacrifice something to reach a scale needed by some systems and to minimize down time. It's a trade-off. And any distributed database will have real-world weaknesses.


> There are very strong CAP systems

There are not strong CAP systems -- that is the whole point of the CAP theorem. It's impossible. Am I misunderstanding you?


No, zookeeper doesn't do magic. It does, however, create a large enough value for P to be lost that it can be thought of as CAP; The number of failures required is high enough that, in normal use and sane configurations, you won't partition.


P (partition tolerance) refers to network partitions. There's nothing a distributed system can do to prevent network partitions; they're a property of the network. The question is, in the face of partitions, what does the software do? It basically has two options:

* Respond to requests, even though it may not have the most up to date information. I.e., it sacrifices consistency. These are AP systems.

* Not respond to requests, in which case it sacrifices availability. These are CP systems, of which ZK is one.

In particular with ZK, if you lose quorum (i.e., the cluster has fewer than (n + 1) / 2 active nodes where n is the cluster size) the cluster (or partition thereof) will become unavailable in order to avoid sacrificing consistency.


So it should be called the CA theorem, right? You can't choose partition-intolerance.


Non-distributed databases (like postgres) are sometimes considered to be CA. But I think it's clear that CA doesn't make sense in a distributed system.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: