Great work and love the detailed breakdown. This is kind of tangential, but it reminded me of this work: https://arxiv.org/pdf/2310.12973 (Frozen Transformers in Language Models are Effective Visual Encoder Layers).
The paper puts out an interesting hypothesis that these LLM-derived transformer layers have the ability to "refine" any set of learned tokens, even in different modalities. I wonder if what you're seeing here is related?
The paper puts out an interesting hypothesis that these LLM-derived transformer layers have the ability to "refine" any set of learned tokens, even in different modalities. I wonder if what you're seeing here is related?