> I'm not sure how much space it would take up to store a list of urls that have...

tleb_ · on Aug 5, 2020

I was curious so I downloaded a list of id-url pairs from here [0]. It's CSV formatted and contains 1_960_207 entries (last update being 22 feb 2019). It is 134MiB uncompressed and 35MiB compressed using xz, so definitely storable in a web extension.

IDs being integers smaller than 10_000_000, they can be stored in 3 bytes and using a 64 bits hash function is enough (using this approximation [1] with k=2_000_000 and N=2^64 gives p=1,08e-7) which accounts for 22MB for 2 million entries. Stats on duplicates would be needed to know the impact of bundling identical hashes together. Definitely doable!

Keeping up-to-date would be harder, having a server querying the API to collect and distribute the day-by-day data to every extension-user is probably the best option.

[0]: https://console.cloud.google.com/marketplace/product/y-combi... [1]: https://preshing.com/20110504/hash-collision-probabilities/

Ruthalas · on Aug 5, 2020

If you can find that thread, I'd love to mirror that content!