Great work and love the detailed breakdown. This is kind of tangential, but it r...

Great work and love the detailed breakdown. This is kind of tangential, but it reminded me of this work: https://arxiv.org/pdf/2310.12973 (Frozen Transformers in Language Models are Effective Visual Encoder Layers).

The paper puts out an interesting hypothesis that these LLM-derived transformer layers have the ability to "refine" any set of learned tokens, even in different modalities. I wonder if what you're seeing here is related?