This is the repository for the survery paper: Representation Potentials of Foundation Models for Multimodal Alignment: A Survey. The collection will be continuously updated, so star (🌟) & stay tuned. Any suggestions and comments are welcome (jianglinlu@outlook.com).
Figure 1. CKA scores between different models computed on MS-COCO and NOCAPS datasets.
Deep Residual Learning for Image Recognition * Kaiming He et al, CVPR 2016.* [PDF]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy et al, ICLR 2021. [PDF]
A ConvNet for the 2020s Zhuang Liu et al, CVPR 2022. [PDF]
ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders Sanghyun Woo et al, CVPR 2023. [PDF]
DINOv2: Learning Robust Visual Features without Supervision Maxime Oquab et al, arXiv 2024. [PDF]
DINOv3. Oriane Siméoni et al, arXiv 2025. [PDF]
Segment Anything Alexander Kirillov et al, ICCV 2023. [PDF]
Language Models are Few-Shot Learners Tom B. Brown et al, NeurIPS 2020. [PDF]
Scaling Laws for Neural Language Models Jared Kaplan et al, arXiv 2020. [PDF]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin et al, NAACL 2019. [PDF]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Colin Raffel et al, JMLR 2020. [PDF]
Emergent Abilities of Large Language Models Jason Wei et al, TMLR 2022. [PDF]
Qwen Technical Report Jinze Bai et al, arXiv 2023. [PDF]
The Llama 3 Herd of Models Aaron Grattafiori et al, arXiv 2024. [PDF]
wav2vec: Unsupervised Pre-training for Speech Recognition Steffen Schneider et al, arXiv 2019. [PDF]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Alexei Baevski et al, NeurIPS 2020. [PDF]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units Wei-Ning Hsu et al, arXiv 2021. [PDF]
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing Sanyuan Chen et al, arXiv 2022. [PDF]
Robust Speech Recognition via Large-Scale Weak Supervision Alec Radford et al, ICML 2023. [PDF]
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation Seamless Communication et al, arXiv 2023. [PDF]
Learning Transferable Visual Models From Natural Language Supervision Alec Radford, et al, ICML 2021. [PDF]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Chao Jia et al, ICML 2021. [PDF]
Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. Junnan Li et al, ICML 2022. [PDF]
Coca: Contrastive captioners are image-text foundation models. Jiahui Yu et al, arXiv 2022. [PDF]
Flamingo: a visual language model for few-shot learning. Jean-Baptiste Alayrac et al, NeurIPS 2022. [PDF]
PaLI: A Jointly-Scaled Multilingual Language-Image Model. Xi Chen et al, ICLR 2023. [PDF]
GPT-4 Technical Report. OpenAI, arXiv 2023. [PDF]
Gemini: A Family of Highly Capable Multimodal Models. Gemini Team, arXiv 2023. [PDF]
A Survey on Multimodal Large Language Models. Shukang Yin et al, National Science Review 2024. [PDF]
Supervised Feature Selection via Dependence Estimation. Le Song et al, ICML 2007. [PDF]
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Maithra Raghu et al, NIPS 2017. [PDF]
Representational models: A common framework for understanding encoding, pattern-component, and representational-similarity analysis. Jörn Diedrichsen et al, PLoS computational biology 2017. [PDF]
Insights on Representational Similarity in Neural Networks with Canonical Correlation. Ari S. Morcos et al, NeurIPS 2018. [PDF]
Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation. Liwei Wang et al, NeurIPS 2018. [PDF]
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Laleh Haghverdi et al, Nature Biotechnology 2018. [PDF]
Similarity of Neural Network Representations Revisited Simon Kornblith. et al, ICML 2019. [PDF]
On the Cross-lingual Transferability of Monolingual Representations. Mikel Artetxe et al, ACL 2020. [PDF]
Towards Understanding the Instability of Network Embedding. Chenxu Wang et al, TKDE 2020. [PDF]
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth. Thao Nguyen et al, ICLR 2021. [PDF]
Using distance on the Riemannian manifold to compare representations in brain and in models. Mahdiyar Shahbazi et al, NeuroImage 2021. [PDF]
Reliability of CKA as a Similarity Measure in Deep Learning. MohammadReza Davari et al, ICLR 2023. [PDF]
Understanding the Inner Workings of Language Models Through Representation Dissimilarity. Davis Brown et al, EMNLP 2023. [PDF]
What Representational Similarity Measures Imply about Decodable Information. Sarah E. Harvey et al, arXiv 2024. [PDF]
Similarity of Neural Network Models: A Survey of Functional and Representational Measures. Max Klabunde et al, arXiv 2025. [PDF]
Understanding Image Representations by Measuring Their Equivariance and Equivalence. Karel Lenc et al, CVPR 2015. [PDF]
Convergent Learning: Do different neural networks learn the same representations? Yixuan Li et al, ICLR 2016. [PDF]
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Maithra Raghu et al, NIPS 2017. [PDF]
A Spline Theory of Deep Learning. Randall Balestriero et al, ICML 2018. [PDF]
Insights on Representational Similarity in Neural Networks with Canonical Correlation. Ari S. Morcos et al, NeurIPS 2018. [PDF]
Similarity of Neural Network Representations Revisited Simon Kornblith. et al, ICML 2019. [PDF]
Similarity and Matching of Neural Network Representations. Adrián Csiszárik et al, NeurIPS 2021. [PDF]
On Linear Identifiability of Learned Representations. Geoffrey Roeder et al, ICML 2021. [PDF]
Do Self-Supervised and Supervised Methods Learn Similar Visual Representations? Tom George Grigg et al, arXiv 2021. [PDF]
Revisiting Model Stitching to Compare Neural Representations. Yamini Bansal et al, NeurIPS 2021. [PDF]
Emerging Properties in Self-Supervised Vision Transformers. Mathilde Caron et al, ICCV 2021. [PDF]
Do Vision Transformers See Like Convolutional Neural Networks? Maithra Raghu et al, NeurIPS 2021. [PDF]
Relative Representations Enable Zero-Shot Latent Space Communication. Luca Moschella et al, ICLR 2023. [PDF]
Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations. Shashank Shekhar et al, arXiv 2023. [PDF]
Rosetta Neurons: Mining the Common Units in a Model Zoo. Amil Dravid et al, ICCV 2023. [PDF]
DINOv2: Learning Robust Visual Features without Supervision. Maxime Oquab et al, TMLR 2024. [PDF]
ZipIt! Merging Models from Different Tasks without Training. George Stoica et al, ICLR 2024. [PDF]
The Platonic Representation Hypothesis. Minyoung Huh et al, ICML 2024. [PDF]
How Do the Architecture and Optimizer Affect Representation Learning? On the Training Dynamics of Representations in Deep Neural Networks. Yuval Sharon et al, arXiv 2025. [PDF]
Dual Diffusion for Unified Image Generation and Understanding. Zijie Li et al, CVPR 2025. [PDF]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. Sihyun Yu et al, ICLR 2025. [PDF]
Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers. Jason Phang et al, arXiv 2021. [PDF]
Tracing Representation Progression: Analyzing and Enhancing Layer-Wise Similarity. Jiachen Jiang et al, ICLR 2025. [PDF]
The Linear Representation Hypothesis and the Geometry of Large Language Models. Kiho Park et al, ICML 2024. [PDF]
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders. Michael Lan et al, arXiv 2024. [PDF]
Truth is Universal: Robust Detection of Lies in LLMs. Lennart Bürger et al, NeurIPS 2024. [PDF]
Analyzing the Generalization and Reliability of Steering Vectors. Daniel Tan et al, NeurIPS 2024. [PDF]
Cross-lingual Similarity of Multilingual Representations Revisited. Maksym Del et al, arXiv 2022. [PDF]
Universal Neurons in GPT2 Language Models. Wes Gurnee et al, arXiv 2024. [PDF]
Activation Space Interventions Can Be Transferred Between Large Language Models. Narmeen Oozeer et al, ICML 2025. [PDF]
Transferring Features Across Language Models With Model Stitching. Alan Chen et al, arXiv 2025. [PDF]
Update Your Transformer to the Latest Release: Re-Basin of Task Vectors. Filippo Rinaldi et al, ICML 2025. [PDF]
Shared Global and Local Geometry of Language Model Embeddings. Andrew Lee et al, COLM 2025. [PDF]
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures. Junxuan Wang et al, ICLR 2025. [PDF]
Emergence of a High-Dimensional Abstraction Phase in Language Transformers. Emily Cheng et al, ICLR 2025. [PDF]
Insights on Neural Representations for End-to-End Speech Recognition. Anna Ollerenshaw et al, Interspeech 2021. [PDF]
Similarity Analysis of Self-Supervised Speech Representations. Yu-An Chung et al, ICASSP 2021. [PDF]
Comparative Layer-Wise Analysis of Self-Supervised Speech Models. Ankita Pasad et al, ICASSP 2023. [PDF]
What Do Self-Supervised Speech Models Know About Words?. Ankita Pasad et al, arXiv 2024. [PDF]
What Do Speech Foundation Models Not Learn About Speech? Abdul Waheed et al, arXiv 2024. [PDF]
How Redundant Is the Transformer Stack in Speech Representation Models? Teresa Dorszewski et al, ICASSP 2025. [PDF]
Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0. Robin Huo et al, Interspeech 2025. [PDF]
Linearly Mapping from Image to Text Space. Jack Merullo et al, ICLR 2023. [PDF]
Grounding Language Models to Images for Multimodal Inputs and Outputs. Jing Yu Koh et al, ICML 2023. [PDF]
Do Vision and Language Encoders Represent the World Similarly? Mayug Maniparambil et al, CVPR 2024. [PDF]
What Do Language Models Hear? Probing for Auditory Representations in Language Models. Jerry Ngo et al, ACL 2024. [PDF]
How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations. Hyunji Lee et al, arXiv 2024. [PDF]
The Platonic Representation Hypothesis. Minyoung Huh et al, ICML 2024. [PDF]
Assessing and Learning Alignment of Unimodal Vision and Language Models. Le Zhang et al, CVPR 2025. [PDF]
The Indra Representation Hypothesis for Multimodal Alignment. Jianglin Lu et al, NeurIPS 2025. [PDF]
Evaluation of the Hierarchical Correspondence between the Human Brain and Artificial Neural Networks: A Review. Trung Quang Pham et al, Biology, 2023. [PDF]
Do Self-Supervised Speech and Language Models Extract Similar Representations as Human Brain?. Peili Chen et al, ICASSP, 2024. [PDF]
Privileged representational axes in biological and artificial neural networks. Meenakshi Khosla et al, bioRxiv, 2024. [PDF]
Universality of representation in biological and artificial neural networks. Eghbal Hosseini et al, bioRxiv 2024 . [PDF]
High-level visual representations in the human brain are aligned with large language models. Adrien Doerig et al, Nature Machine Intelligence 2025. [PDF]
Disentangling the Factors of Convergence between Brains and Computer Vision Models. Joséphine Raugel et al, arXiv 2025. [PDF]
Brain-Model Evaluations Need the NeuroAI Turing Test. Jenelle Feather et al, arXiv 2025. [PDF]
Scaling Laws for Neural Language Models. Jared Kaplan et al, arXiv 2020. [PDF]
Inductive Biases and Variable Creation in Self-Attention Mechanisms. Benjamin L. Edelman et al, ICML 2022. [PDF]
Multitask Prompted Training Enables Zero-Shot Task Generalization. Victor Sanh et al, ICLR 2022. [PDF]
Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions. Satwik Bhattamishra et al, ACL 2023. [PDF]
Large language models converge toward human-like concept organization. Mathias Lykke Gammelgaard et al, arXiv 2023. [PDF]
Multilingual Diversity Improves Vision-Language Representations. Thao Nguyen et al, NeurIPS 2024. [PDF]
Scaling Instruction-Finetuned Language Models. Hyung Won Chung et al, JMLR 2024. [PDF]
Instruction Diversity Drives Generalization To Unseen Tasks. Dylan Zhang et al, arXiv 2024. [PDF]
Objective drives the consistency of representational similarity across datasets. Laure Ciernik et al, ICML 2025. [PDF]
Relational reasoning and inductive bias in transformers trained on a transitive inference task. Jesse Geerts et al, arXiv 2025. [PDF]
Updating. * et al, .* [PDF]
If you find our survey useful, please consider citing:
@inproceedings{lu2025representation,
title={Representation Potentials of Foundation Models for Multimodal Alignment: A Survey},
author={Lu, Jianglin and Wang, Hailing and Xu, Yi and Wang, Yizhou and Yang, Kuo and Fu, Yun},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025}
}
@inproceedings{Jianglin2025,
title={The Indra Representation Hypothesis for Multimodal Alignment},
author={Lu, Jianglin and Wang, Hailing and Yang, Kuo and Zhang, Yitian and Jenni, Simon and Fu, Yun},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}