Shazam-like fingerprinting for text. The complete LLM outputs wouldn't need to b...

chaxor · on June 2, 2023

This has been done for a very long time. Blockchains are definitely not required (this isn't just the usual hate from HN of Blockchain, it just actually doesn't make sense here). Fingerprinting by shingling (windows of text) with some normalization steps is pretty typical in plagiarism or similarity detection. A big database of docid-shingleid pairs along with weights for their frequency is often a very simple and fast way to do this analysis. The big part is getting OpenAI/anthropic/etc to do it on their data and provide a service for that, but there's obviously a lot of unwanted consequences - specifically storing of all user data (even if the shingled and docids are hashes, it's still info).