Probabilistic Filters By Example

ko27 · on March 28, 2018

Cuckoo filter is used in the upcoming Cache Digest HTTP Standard:

> HTTP/2 frame type to allow clients to inform the server of their cache’s contents. Servers can then use this to inform their choices of what to push to clients.

http://httpwg.org/http-extensions/draft-ietf-httpbis-cache-d...

https://github.com/httpwg/http-extensions/pull/413

codetrotter · on March 28, 2018

Speaking of which, does anyone know of any HTTP/2 servers that make use of this feature? Are clients making use of it?

ko27 · on March 28, 2018

Not this standard, but the H2O server can use client cache aware HTTP2 Push:

https://h2o.examp1e.net/configure/http2_directives.html#http...

drdebug · on March 28, 2018

Hopefully this cannot be used to fingerprint clients!

sametmax · on March 28, 2018

Not exactly fingerprint, but it may help to validate or not existing fingerprinting attempt.

E.G: you identify user has being x, so you push a unique content to it, and tell to cache. Then later, you have a 62% of change some browser is user x. You just check if it's using the cached version of not. It will confirm at least if it's user x.

plopilop · on March 28, 2018

Strange coincidence, just this morning I was reading a paper [0] from 2008 about Bloom filters. Specifically, their false positive rate is higher than what is usually advertised - even on Wikipedia.

Edit: also as is written in the document, Cuckoo Filters and Bloom filters are not adapted to unbounded streams. Better options may include, for instance, Stable Bloom Filter[1] and block-Decaying Bloom Filter[2]

[0]: http://cglab.ca/~morin/publications/ds/bloom-submitted.pdf

[1]: http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDetExt.pdf

[2]: https://link.springer.com/article/10.1007/s11390-008-9192-1

FreakLegion · on March 29, 2018

I've only glanced at [0] but I'm skeptical on the face of it. Everything about Bloom filters falls right out of the binomial distribution, provided you follow the assumptions for optimal number of bits and hash functions. Given those,

    n = 1, k = 2, m = 2

from the top of page 2 isn't even a valid configuration. Among other things, filters in which more than half the bits are set (within some tolerance) should be rejected. In this example a third of the filter's possible states are completely saturated and have to be tossed out. Otherwise you have a 100% false positive rate for that filter.

I.e. if a saturated filter were acceptable, there are 3 states: 01, 10, 11. These have, respectively, a 1 in 3, 1 in 3, and 3 in 3 false positive chance, for a total of 5 in 9 (I assume the 5 in 8 from the paper is a typo). But actually there are only 2 valid states: 01, 10. The valid inputs are still 01, 10, 11, though, so the real false positive chance is 1 in 3.

Maybe I'm missing something.

plopilop · on March 29, 2018

"Valid" configurations do not exist, there are only optimal/suboptimal and saturated/unsaturated ones. The classic FPR makes no assumptions about this (indeed optimality is derived from FPR minimum), and as such should apply as well to suboptimal saturated configuration, were it correct.

However looking at theorem 3 in [0] I think the difference between "classic" and "real" FPR is small, especially for larger n and m. At the end of [0] is mentioned a study in which the authors could not reach the classic FPR in their simulations, but attributed this result to bad pseudorandomness of elements. [0]'s author claim that the difference may instead come from the real FPR value (which is actually very hard to compute, so it's hard to be sure).

The 5/8 probability mentioned is indeed correct. First element has four equiprobable possible outputs for its two hashes: (0,0), (0,1), (1,0) and (1,1). Same for the second element, which yields 16 different cases. Out of these 16 cases 10 will lead to a false positive.

FreakLegion · on April 2, 2018

> The 5/8 probability mentioned is indeed correct. First element has four equiprobable possible outputs for its two hashes: (0,0), (0,1), (1,0) and (1,1)

Thanks, that makes sense. I was thinking of states the filter could be in and ignored the different paths to get there. Now that I've had a chance to look at [0], it's interesting, but doesn't seem to matter for Bloom filters used in the normal way.

> "Valid" configurations do not exist

Sure, you can use any numbers you want, even negative ones. I take a more practical view, since Bloom filters make assumptions about choice (or derivation) of parameters. Given those assumptions, the formula criticized in [0] seems to work. Specifically:

> [0]'s author claim that the difference may instead come from the real FPR value (which is actually very hard to compute, so it's hard to be sure).

We know the FPR of any Bloom filter that actually exists. For a load factor l and number of hash functions k, it's l^k. The problem is that the FPR for the expected load factor of a set of parameters is lower than the expected FPR for those parameters, iff the expected load factor is greater than half. But the whole point of a Bloom filter is to have half the bits set; that's the magic that minimizes size and FPR.

In practice you won't often have an expected load factor of exactly half, due to rounding. Depending on how you handle that, and whether you use asymptotics or an exact calculation, some drift in FPR makes sense. It'll be negligible as long as you're targeting half, though. n = 1, k = 2, m = 2 is noticeably off because at 3/4 it has a much higher expected load factor.

Recursing · on March 28, 2018

> even on Wikipedia

Did you edit Wikipedia / add a reference to the paper?

bograt · on March 28, 2018

I'm familiar with both cuckoo hashing and Bloom filters but had not until now seen a demonstration of cuckoo filters; they look very useful. But there is one stated fact that I'm finding hard to believe:

> Cuckoo filters improve on Bloom filters by supporting deletion

The page implies that this is achieved by removing the fingerprint from the hash table, but presumably one cannot guarantee that another key doesn't share the same fingerprint. This would result in a false negative for that key and violate an essential characteristic of the data structure.

Perhaps there's a nuance of the implementation I've missed.

bdupras · on March 28, 2018

A few types of filters support deletion (e.g. counting bloom filters and cuckoo filters). To my knowledge all of them require prior knowledge that an entry was successfully inserted. That is, if you guess at deleting an entry and are successful, then you'll ruin the filter by creating a false-negative case.

As for collisions, only a limited number of colliding entries can be inserted into a cuckoo - after which a subsequent insertion will knowingly fail. So say Alice and Bob hash out to the same values, and you've inserted 2 Alices and 1 Bob. When you delete one Alice, it doesn't matter which entry is removed - two entries still remain that hash to Alice and Bob.

Perhaps the nuance you mention is that the alternate bucket index for an entry is calculated only from the fingerprint, not from the entry data. This has the effect that even when multiple entries share fingerprints, any one can be removed. (Again, _only_ if the system requesting the delete positively knows that the entry was previously successfully inserted.) The other colliding entry will still be found on query and will generate a "maybe" response.

PokemonNoGo · on March 28, 2018

I had the same thoughts and checked the paper[0]. They address it under 3.3 Delete

>Deletion does not have to clean the entry after deleting an item. It also avoids the “false deletion”....

[0]https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

bograt · on March 28, 2018

Having quickly read the paper, the deletion guarantees seem slightly weak.

An item's position in the table is derived from two things: a fingerprint (a constant-sized hash) and second hash (ranging over the table). Nothing prevents two or more items from colliding on both hashes and therefore being indistinguishable from each other.

If the number of items in such a collision exceeds twice the fixed bucket size then deletion may result in false negatives.

In most practical applications there will be no useful way to bound the number of collisions. The paper shows results with bucket sizes of 4 and 8, but I don't know what the real-world probabilities of breaching these limits would be.

hinkley · on March 28, 2018

Hashing items that present themselves as equivalent is a very old problem with plenty of literature. It can be a pain in the butt to fix retroactively but it does tend to get fixed when there are big enough performance or correctness concerns in play.

plopilop · on March 28, 2018

The idea is that two elements x and y despite having the same fingerprint, will be stored into two different buckets. So you can have several time the same fingerprint in a Cuckoo filter, which you can't in a Bloom Filter.

When you want to remove x, you remove one of the two fingerprints, it does not matter which since they're the same. Next when you query y you will be sure to still have a positive answer.

Note that however the deletion won't necessarily make the filter forget about x. It will just clear some place in the structure.

throwawayReply · on March 28, 2018

I was trying out the javascript example, but managed to get a case where there is a negative conclusion but this goes against the introduction where it says negative conclusions are always definite.

My multiset was:

   a, b, c, d, e, f, g, h, j, abc, def, asd, asds, 3g46stb6vy6vsyosyvosfsdfsdsah, oooooooooooooooo

Then trying oooooooooooooooo a second time, the bloom filter correctly says it might be in the set. The cuckoo filter says it is not.

I assume this is just a javascript bug in implementation, because previously the filter said it might be in the set, actually adding it to the set then gave a false negative when trying it again.

bdupras · on March 28, 2018

Yeah - pardon the crappy javascript and UX. Notice the colors on the cuckoo filter when you insert this sequence. When you try to insert `3g46stb6vy6vsyosyvosfsdfsdsah`, the fingerprint entries in the cuckoo turn red indicating an unsuccessful entry. All possible fingerprint slots in the cuckoo for that entry were already occupied, and the recursive rearranging of entries was unable to free up a slot.

This is a down-side to cuckoos (and counting blooms) - even when the filter has reasonable capacity remaining, an entry can be denied due to collision with previous entries and saturation of their slots.

The UX on this demo could be better - e.g. not insert into the bloom unless the cuckoo insert is successful (to keep them from diverging). Also, a message indicating an unsuccessful insertion would be nice.

bloomer · on March 28, 2018

I wish this was something that people brought up more prominently when discussing cuckoo filters. They fail catastrophically. A bloom filter gradually fails as the number of items inserted increases and the false positive rate tends to one. But a cuckoo filter that fills up the buckets for a fingerprint has to reject an insert meaning that either you start to have false negatives or to guarantee that you don't have false negatives you have to immediately return true for any query, which is a false positive rate of exactly one with a sharp discontinuity in behavior from before the failed insert. I think the salesmanship of cuckoo filters has been a little overdone and in most applications a bloom filter is a better choice.

bdupras · on March 30, 2018

Yes, cuckoo filters and counting bloom filters can start rejecting insertions before the filter is loaded - cuckoos because of the inability to find an open slot, and counting blooms because of saturation of a counter. Note that this is only the case if deletion or counting support are required.

In some cases, like the when there's a high likelihood of duplicate entries, this can represent a catastrophic failure mode. This is a good point, and when I get around to updating the tutorial, I'll make note of this.

Regarding the failure mode, an insertion can be knowingly rejected without mutating the filter, so the filter remains useful for subsequent insertions and all reads, retaining all its guarantees. The only loss is the rejection of the item - in our case we used this as signal to rebuild the filter at a larger size.

bassman9000 · on March 29, 2018

Why the throaway account, though?

chronolitus · on March 28, 2018

"These filters can claim that a given entry is definitely not represented in a set of entries, or might be represented in the set."

unless the Cuckoo filter has any two filled buckets, right? Which is a pretty critical failure mode not mentioned here. (What do you do then, accept the small new false-negative likelihood? start over with another filter?)

jetrink · on March 28, 2018

If both buckets are filled, one of the existing occupants is evicted and reinserted at its alternate location. If there is an item already at that location, it is likewise evicted and reinserted. This process continues until an open space is found or a loop is detected. If there is a loop, the hash table needs to be rebuilt with more capacity or new hash functions.

bdupras · on March 30, 2018

chronolitus did jetrink's comment answer your question? I'm happy to further clarify if needed.

chronolitus · on April 4, 2018

ah, I understand. The website demo had me confused. thanks for your help!

jitans · on March 28, 2018

Add five "a", now remove four "a" Cuckoo claims there are no "a" in the multiset...

yeahboats · on March 28, 2018

There is a comment elsewhere from the creator I think that explains this. The 5th "a" never gets added, you'll see that the cells turn red. After you remove 4 "a"s then you're back at 0 and it is correctly reporting that there aren't any.

jitans · on March 28, 2018

well the demo then is broken indeed it shows still an "a" inside the multiset. Having say that I would never use a cuckoo in a multiset, indeed it becomes a bounded multiset. Also about the K-hashing for a bloom filter is not exact. You just need need a K-bit hashing function. If the goal is to support deletion then I would go for a Counting Bloom filter

bdupras · on March 30, 2018

Yeah you're correct in saying that the demo is broken - sorry for the confusion, I stopped working on this when it was good enough for an internal talk at my company.

Re multisets, yes cuckoo filters and counting blooms are bounded - cuckoos by entries or bucket x 2, and blooms by bits per counter. Cuckoo's counting capability is more a side effect of the design. It's really only interesting in that you get limited counting in roughly the same bit space as a non-counting bloom. I suppose you could add counting bits to a cuckoo to support higher bounds in a similar manner to counting blooms.

I'm curious to hear more on your point on K-hashing in bloom filters - would you expand on your statement about only needing a K-bit hashing function? I'm happy to update the text on the tutorial to be more clear/exact.