I don't think they are generations, but rather samples from The Pile that are semantically close to the input.
Actually, as far as I can tell - the RETRO arch itself isn't trained in this article. It focuses more on how to build the retrieval system with a fast KNN index over all of the Pile.
This is great for speed, maybe we can also increase the window size if the model is so small, but how about the quality of the generated text? With a 20x smaller model does quality drop?
How many chunks do you retrieve? The paper shows best results at k=1 and then at k>50.
I saw the graph. Bits-per-byte is not clear for me (for example lower loss does not always mean better accuracy when you train a classifier). What I needed is to know if the generated text has the same "literary" qualities, and also if it can compete in few shot mode on as many tasks as GPT-3 can do.
In other words, can we rely on it instead of GPT-3 in realistic scenarios or is it good only for low BpB?
I imagine that GPT-3's ability to complete prompts it hasn't seen will make it quite a bit more versatile than using the Pile. It's the interpolation between the data points that makes deep learning so powerful.
This obviously still has plenty of use cases. Could get a batch of similar text to calculate statistics on. Or perhaps augment existing data for training something else.
One realistic use case would be to simply present a search engine interface enabling you to find "interesting" text snippets alongside metadata like book author/title matching your description, perhaps for fiction enthusiasts or what have you.
I guess an interesting way to translate this technique to text-to-image would be to get an image from a database that matches the text query (via CLIP) and feed that + noise into a diffusion model that only does a few denoising iterations (and no clip guidance maybe). Would be a lot faster than from-scratch diffusion.
(Another way could be to redo the architecture to include a “inspired by this image” input, which is queried from an image server at inference time.) Anyone have other ideas?
This was one of the motivations for `clip-retrieval`, a faiss index over the CLIP embeddings (CLIP ViT-L/14 to be precise) for all the captions/images in the LAION5B-Aesthetic dataset.
I tried "a man with shopping bags stopping a tank" and was hoping to get the Tiananmen Tank Man, but I'm having no luck with variations either.
EDIT: it does contain a blurry picture of the tank man and some LEGO re-enactments when I query "tiananmen tank man", but was hoping it would more intelligently deduce the picture from the description
"Please don't complain about tangential annoyances—things like article or website formats, name collisions, or back-button breakage. They're too common to be interesting."
OK I didn't know this was against the rules, I see it on here often. Small quibble: difficulty to actually read the submission doesn't seem completely tangential.
On macOS (display: 15.4-inch, 2880 × 1800), it's really difficult to read. I set the font to ''400 1.2rem/1.5 "Fira Sans",sans-serif'' and color to #111 in dev tools, way better readability.
(sidenote: is Fira Sans a default installed font on Linux systems? I'm on macOS and don't have that, and don't see a font embed anywhere in your source code. So that might be the issue - 'sans-serif' at 200 weight is way too faint)