> I'm not sure how much space it would take up to store a list of urls that have been submitted to HN
I recently was wondering how much space this would take up myself. After a lot of searching, I found this reddit post, which links to an archive of Hackernews. It contains data from HN from late 2006 until mid 2018 and totals just over 2 gigabytes. These dumps contain all comments, job postings, polls, poll options, and stories.
I did some super quick analysis of the 2018-05 archive (the latest provided by this source). I found that there were 237,646 total items, and only 32,473 of those are stories. That's only ~14%. Assuming the ratio of stories to non-stories has been constant for the entire dataset, that's only 280 megabytes for the entire 2006 to 2018 set.
That data can further be shrunk by removing extraneous information from each story in the data. Mirroring the HN api, it has the following pieces of data for each item: author username, id, date retrieved, score, time posted, title, type, url, and whether its dead, how many descendants it has, and what items are its kids. I didn't attempt to reduce the data to only contain links, but I imagine it would be significantly reduce the size.
Once you've reduced the data down to a list of urls, I imagine it can be reduced even more by removing duplicate links.
Depending on the average size of the urls, it's not unreasonable to think that taking a hash of each of the urls would result in a smaller set of data.
On top of that, there's wonderful text compression, but I don't have the numbers on how much that would reduce the size of data.
I was curious so I downloaded a list of id-url pairs from here [0]. It's CSV formatted and contains 1_960_207 entries (last update being 22 feb 2019). It is 134MiB uncompressed and 35MiB compressed using xz, so definitely storable in a web extension.
IDs being integers smaller than 10_000_000, they can be stored in 3 bytes and using a 64 bits hash function is enough (using this approximation [1] with k=2_000_000 and N=2^64 gives p=1,08e-7) which accounts for 22MB for 2 million entries. Stats on duplicates would be needed to know the impact of bundling identical hashes together. Definitely doable!
Keeping up-to-date would be harder, having a server querying the API to collect and distribute the day-by-day data to every extension-user is probably the best option.
I recently was wondering how much space this would take up myself. After a lot of searching, I found this reddit post, which links to an archive of Hackernews. It contains data from HN from late 2006 until mid 2018 and totals just over 2 gigabytes. These dumps contain all comments, job postings, polls, poll options, and stories.
I did some super quick analysis of the 2018-05 archive (the latest provided by this source). I found that there were 237,646 total items, and only 32,473 of those are stories. That's only ~14%. Assuming the ratio of stories to non-stories has been constant for the entire dataset, that's only 280 megabytes for the entire 2006 to 2018 set.
That data can further be shrunk by removing extraneous information from each story in the data. Mirroring the HN api, it has the following pieces of data for each item: author username, id, date retrieved, score, time posted, title, type, url, and whether its dead, how many descendants it has, and what items are its kids. I didn't attempt to reduce the data to only contain links, but I imagine it would be significantly reduce the size.
Once you've reduced the data down to a list of urls, I imagine it can be reduced even more by removing duplicate links.
Depending on the average size of the urls, it's not unreasonable to think that taking a hash of each of the urls would result in a smaller set of data.
On top of that, there's wonderful text compression, but I don't have the numbers on how much that would reduce the size of data.