Representation-Alignment-Survey

Representation Potentials of Foundation Models for Multimodal Alignment: A Survey (EMNLP25)

Awesome Representation Alignment Paper Reading

This is the repository for the survery paper: Representation Potentials of Foundation Models for Multimodal Alignment: A Survey. The collection will be continuously updated, so star (🌟) & stay tuned. Any suggestions and comments are welcome (jianglinlu@outlook.com).

Figure 1. CKA scores between different models computed on MS-COCO and NOCAPS datasets.

Contents

Foundation Models [Back to Top]

  1. On the Opportunities and Risks of Foundation Models Rishi Bommasani et al, arXiv 2022. [PDF]

Vision Foundation Models

  1. Deep Residual Learning for Image Recognition * Kaiming He et al, CVPR 2016.* [PDF]

  2. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy et al, ICLR 2021. [PDF]

  3. A ConvNet for the 2020s Zhuang Liu et al, CVPR 2022. [PDF]

  4. ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders Sanghyun Woo et al, CVPR 2023. [PDF]

  5. DINOv2: Learning Robust Visual Features without Supervision Maxime Oquab et al, arXiv 2024. [PDF]

  6. DINOv3. Oriane Siméoni et al, arXiv 2025. [PDF]

  7. Segment Anything Alexander Kirillov et al, ICCV 2023. [PDF]

Large Language Models

  1. Language Models are Few-Shot Learners Tom B. Brown et al, NeurIPS 2020. [PDF]

  2. Scaling Laws for Neural Language Models Jared Kaplan et al, arXiv 2020. [PDF]

  3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin et al, NAACL 2019. [PDF]

  4. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Colin Raffel et al, JMLR 2020. [PDF]

  5. Emergent Abilities of Large Language Models Jason Wei et al, TMLR 2022. [PDF]

  6. Qwen Technical Report Jinze Bai et al, arXiv 2023. [PDF]

  7. The Llama 3 Herd of Models Aaron Grattafiori et al, arXiv 2024. [PDF]

Speech Foundation Models

  1. wav2vec: Unsupervised Pre-training for Speech Recognition Steffen Schneider et al, arXiv 2019. [PDF]

  2. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Alexei Baevski et al, NeurIPS 2020. [PDF]

  3. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units Wei-Ning Hsu et al, arXiv 2021. [PDF]

  4. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing Sanyuan Chen et al, arXiv 2022. [PDF]

  5. Robust Speech Recognition via Large-Scale Weak Supervision Alec Radford et al, ICML 2023. [PDF]

  6. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation Seamless Communication et al, arXiv 2023. [PDF]

Multimodal Foundation Models

  1. Learning Transferable Visual Models From Natural Language Supervision Alec Radford, et al, ICML 2021. [PDF]

  2. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Chao Jia et al, ICML 2021. [PDF]

  3. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. Junnan Li et al, ICML 2022. [PDF]

  4. Coca: Contrastive captioners are image-text foundation models. Jiahui Yu et al, arXiv 2022. [PDF]

  5. Flamingo: a visual language model for few-shot learning. Jean-Baptiste Alayrac et al, NeurIPS 2022. [PDF]

  6. PaLI: A Jointly-Scaled Multilingual Language-Image Model. Xi Chen et al, ICLR 2023. [PDF]

  7. GPT-4 Technical Report. OpenAI, arXiv 2023. [PDF]

  8. Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, arXiv 2023. [PDF]

  9. A Survey on Multimodal Large Language Models. Shukang Yin et al, National Science Review 2024. [PDF]

Alignment Metrics [Back to Top]

  1. Supervised Feature Selection via Dependence Estimation. Le Song et al, ICML 2007. [PDF]

  2. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Maithra Raghu et al, NIPS 2017. [PDF]

  3. Representational models: A common framework for understanding encoding, pattern-component, and representational-similarity analysis. Jörn Diedrichsen et al, PLoS computational biology 2017. [PDF]

  4. Insights on Representational Similarity in Neural Networks with Canonical Correlation. Ari S. Morcos et al, NeurIPS 2018. [PDF]

  5. Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation. Liwei Wang et al, NeurIPS 2018. [PDF]

  6. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Laleh Haghverdi et al, Nature Biotechnology 2018. [PDF]

  7. Similarity of Neural Network Representations Revisited Simon Kornblith. et al, ICML 2019. [PDF]

  8. On the Cross-lingual Transferability of Monolingual Representations. Mikel Artetxe et al, ACL 2020. [PDF]

  9. Towards Understanding the Instability of Network Embedding. Chenxu Wang et al, TKDE 2020. [PDF]

  10. Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth. Thao Nguyen et al, ICLR 2021. [PDF]

  11. Using distance on the Riemannian manifold to compare representations in brain and in models. Mahdiyar Shahbazi et al, NeuroImage 2021. [PDF]

  12. Reliability of CKA as a Similarity Measure in Deep Learning. MohammadReza Davari et al, ICLR 2023. [PDF]

  13. Understanding the Inner Workings of Language Models Through Representation Dissimilarity. Davis Brown et al, EMNLP 2023. [PDF]

  14. What Representational Similarity Measures Imply about Decodable Information. Sarah E. Harvey et al, arXiv 2024. [PDF]

  15. Similarity of Neural Network Models: A Survey of Functional and Representational Measures. Max Klabunde et al, arXiv 2025. [PDF]

Representation Potentials of Foundation Models for Alignment [Back to Top]

Representation Alignment in Vision

  1. Understanding Image Representations by Measuring Their Equivariance and Equivalence. Karel Lenc et al, CVPR 2015. [PDF]

  2. Convergent Learning: Do different neural networks learn the same representations? Yixuan Li et al, ICLR 2016. [PDF]

  3. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Maithra Raghu et al, NIPS 2017. [PDF]

  4. A Spline Theory of Deep Learning. Randall Balestriero et al, ICML 2018. [PDF]

  5. Insights on Representational Similarity in Neural Networks with Canonical Correlation. Ari S. Morcos et al, NeurIPS 2018. [PDF]

  6. Similarity of Neural Network Representations Revisited Simon Kornblith. et al, ICML 2019. [PDF]

  7. Similarity and Matching of Neural Network Representations. Adrián Csiszárik et al, NeurIPS 2021. [PDF]

  8. On Linear Identifiability of Learned Representations. Geoffrey Roeder et al, ICML 2021. [PDF]

  9. Do Self-Supervised and Supervised Methods Learn Similar Visual Representations? Tom George Grigg et al, arXiv 2021. [PDF]

  10. Revisiting Model Stitching to Compare Neural Representations. Yamini Bansal et al, NeurIPS 2021. [PDF]

  11. Emerging Properties in Self-Supervised Vision Transformers. Mathilde Caron et al, ICCV 2021. [PDF]

  12. Do Vision Transformers See Like Convolutional Neural Networks? Maithra Raghu et al, NeurIPS 2021. [PDF]

  13. Relative Representations Enable Zero-Shot Latent Space Communication. Luca Moschella et al, ICLR 2023. [PDF]

  14. Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations. Shashank Shekhar et al, arXiv 2023. [PDF]

  15. Rosetta Neurons: Mining the Common Units in a Model Zoo. Amil Dravid et al, ICCV 2023. [PDF]

  16. DINOv2: Learning Robust Visual Features without Supervision. Maxime Oquab et al, TMLR 2024. [PDF]

  17. ZipIt! Merging Models from Different Tasks without Training. George Stoica et al, ICLR 2024. [PDF]

  18. The Platonic Representation Hypothesis. Minyoung Huh et al, ICML 2024. [PDF]

  19. How Do the Architecture and Optimizer Affect Representation Learning? On the Training Dynamics of Representations in Deep Neural Networks. Yuval Sharon et al, arXiv 2025. [PDF]

  20. Dual Diffusion for Unified Image Generation and Understanding. Zijie Li et al, CVPR 2025. [PDF]

  21. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. Sihyun Yu et al, ICLR 2025. [PDF]

Representation Alignment in Language

  1. Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers. Jason Phang et al, arXiv 2021. [PDF]

  2. Tracing Representation Progression: Analyzing and Enhancing Layer-Wise Similarity. Jiachen Jiang et al, ICLR 2025. [PDF]

  3. The Linear Representation Hypothesis and the Geometry of Large Language Models. Kiho Park et al, ICML 2024. [PDF]

  4. Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders. Michael Lan et al, arXiv 2024. [PDF]

  5. Truth is Universal: Robust Detection of Lies in LLMs. Lennart Bürger et al, NeurIPS 2024. [PDF]

  6. Analyzing the Generalization and Reliability of Steering Vectors. Daniel Tan et al, NeurIPS 2024. [PDF]

  7. Cross-lingual Similarity of Multilingual Representations Revisited. Maksym Del et al, arXiv 2022. [PDF]

  8. Universal Neurons in GPT2 Language Models. Wes Gurnee et al, arXiv 2024. [PDF]

  9. Activation Space Interventions Can Be Transferred Between Large Language Models. Narmeen Oozeer et al, ICML 2025. [PDF]

  10. Transferring Features Across Language Models With Model Stitching. Alan Chen et al, arXiv 2025. [PDF]

  11. Update Your Transformer to the Latest Release: Re-Basin of Task Vectors. Filippo Rinaldi et al, ICML 2025. [PDF]

  12. Shared Global and Local Geometry of Language Model Embeddings. Andrew Lee et al, COLM 2025. [PDF]

  13. Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures. Junxuan Wang et al, ICLR 2025. [PDF]

  14. Emergence of a High-Dimensional Abstraction Phase in Language Transformers. Emily Cheng et al, ICLR 2025. [PDF]

Representation Alignment in Speech

  1. Insights on Neural Representations for End-to-End Speech Recognition. Anna Ollerenshaw et al, Interspeech 2021. [PDF]

  2. Similarity Analysis of Self-Supervised Speech Representations. Yu-An Chung et al, ICASSP 2021. [PDF]

  3. Comparative Layer-Wise Analysis of Self-Supervised Speech Models. Ankita Pasad et al, ICASSP 2023. [PDF]

  4. What Do Self-Supervised Speech Models Know About Words?. Ankita Pasad et al, arXiv 2024. [PDF]

  5. What Do Speech Foundation Models Not Learn About Speech? Abdul Waheed et al, arXiv 2024. [PDF]

  6. How Redundant Is the Transformer Stack in Speech Representation Models? Teresa Dorszewski et al, ICASSP 2025. [PDF]

  7. Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0. Robin Huo et al, Interspeech 2025. [PDF]

Representation Alignment Across Modalities [Back to Top]

  1. Linearly Mapping from Image to Text Space. Jack Merullo et al, ICLR 2023. [PDF]

  2. Grounding Language Models to Images for Multimodal Inputs and Outputs. Jing Yu Koh et al, ICML 2023. [PDF]

  3. Do Vision and Language Encoders Represent the World Similarly? Mayug Maniparambil et al, CVPR 2024. [PDF]

  4. What Do Language Models Hear? Probing for Auditory Representations in Language Models. Jerry Ngo et al, ACL 2024. [PDF]

  5. How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations. Hyunji Lee et al, arXiv 2024. [PDF]

  6. The Platonic Representation Hypothesis. Minyoung Huh et al, ICML 2024. [PDF]

  7. Assessing and Learning Alignment of Unimodal Vision and Language Models. Le Zhang et al, CVPR 2025. [PDF]

  8. The Indra Representation Hypothesis for Multimodal Alignment. Jianglin Lu et al, NeurIPS 2025. [PDF]

Representation Alignment with Neuroscience [Back to Top]

  1. Evaluation of the Hierarchical Correspondence between the Human Brain and Artificial Neural Networks: A Review. Trung Quang Pham et al, Biology, 2023. [PDF]

  2. Do Self-Supervised Speech and Language Models Extract Similar Representations as Human Brain?. Peili Chen et al, ICASSP, 2024. [PDF]

  3. Privileged representational axes in biological and artificial neural networks. Meenakshi Khosla et al, bioRxiv, 2024. [PDF]

  4. Universality of representation in biological and artificial neural networks. Eghbal Hosseini et al, bioRxiv 2024 . [PDF]

  5. High-level visual representations in the human brain are aligned with large language models. Adrien Doerig et al, Nature Machine Intelligence 2025. [PDF]

  6. Disentangling the Factors of Convergence between Brains and Computer Vision Models. Joséphine Raugel et al, arXiv 2025. [PDF]

  7. Brain-Model Evaluations Need the NeuroAI Turing Test. Jenelle Feather et al, arXiv 2025. [PDF]

Factors Driving Representation Potential for Alignment

  1. Scaling Laws for Neural Language Models. Jared Kaplan et al, arXiv 2020. [PDF]

  2. Inductive Biases and Variable Creation in Self-Attention Mechanisms. Benjamin L. Edelman et al, ICML 2022. [PDF]

  3. Multitask Prompted Training Enables Zero-Shot Task Generalization. Victor Sanh et al, ICLR 2022. [PDF]

  4. Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions. Satwik Bhattamishra et al, ACL 2023. [PDF]

  5. Large language models converge toward human-like concept organization. Mathias Lykke Gammelgaard et al, arXiv 2023. [PDF]

  6. Multilingual Diversity Improves Vision-Language Representations. Thao Nguyen et al, NeurIPS 2024. [PDF]

  7. Scaling Instruction-Finetuned Language Models. Hyung Won Chung et al, JMLR 2024. [PDF]

  8. Instruction Diversity Drives Generalization To Unseen Tasks. Dylan Zhang et al, arXiv 2024. [PDF]

  9. Objective drives the consistency of representational similarity across datasets. Laure Ciernik et al, ICML 2025. [PDF]

  10. Relational reasoning and inductive bias in transformers trained on a transitive inference task. Jesse Geerts et al, arXiv 2025. [PDF]

  11. Updating. * et al, .* [PDF]

📝 Citation

If you find our survey useful, please consider citing:

@inproceedings{lu2025representation,
  title={Representation Potentials of Foundation Models for Multimodal Alignment: A Survey},
  author={Lu, Jianglin and Wang, Hailing and Xu, Yi and Wang, Yizhou and Yang, Kuo and Fu, Yun},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025}
}

@inproceedings{Jianglin2025,
  title={The Indra Representation Hypothesis for Multimodal Alignment},
  author={Lu, Jianglin and Wang, Hailing and Yang, Kuo and Zhang, Yitian and Jenni, Simon and Fu, Yun},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025}
}