Very cool work! We spend a lot of time thinking about "robust representations" i...

Very cool work! We spend a lot of time thinking about "robust representations" in the video space.

Are there any alternative ideas to JEPA right now, when it comes to speech encoding that couples meaning and sound? Curious to learn more about journey from the problem space to solution space (JEPA).

For context, in our domain video-JEPA hasn't proved to be as helpful as one would have hoped. It's decent at high level semantics (e.g. action detection) but doesn't capture enough "detail" (intentionally so) to be used as a powerful enough encoder (or regularizer). Might be just because the research models are too small / haven't been trained on sufficiently large volumes of data, yet.