Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by VLMs to store global image information; we demonstrate that these models generate surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally. (iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.
The figure below illustrates the flow of visual information in Vision-Language Models (VLMs) through an attention knockout analysis:
This analysis highlights the importance of query tokens in encoding global visual information and emphasizes the mid-layers' pivotal role in visual information flow.
The figure below visualizes the alignment between generated token attention to image tokens and object locations.
Pseudo ground truth object masks were obtained using SAM to validate the alignment. The peak of the attention maps, marked with a white cross, consistently matches the location of the objects in the image, demonstrating the model's ability to attend to specific visual elements.