What’s in the Image? A Deep-Dive into the Vision of Vision Language Models

Weizmann Institute of Science
*Indicates Equal Contribution

Abstract

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by VLMs to store global image information; we demonstrate that these models generate surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally. (iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.


Visual Information Knockout

Our analysis reveals that VLMs compress high-level image information into query text tokens, enabling the model to generate descriptive responses even when direct access to image tokens is blocked. Through knockout experiments, we show that visual information is encoded and retrieved indirectly via the query tokens, highlighting their pivotal role as global image descriptors.

LLM-as-a-judge for Knockout Evaluation

Automatically counting the number of identified or hallucinated objects in free-text paragraphs is challenging.
Therefore, we propose the LLM-as-a-judge evaluation protocol, which allows us to automatically assess object identification and hallucinated objects in generated responses under different knockout configurations.

Analyzing Visual Information Flow in Vision-Language Models

The figure below illustrates the flow of visual information in Vision-Language Models (VLMs) through an attention knockout analysis:

  • (a) Baseline (No Knockout): The VLM employs only causal masking, allowing query and generated tokens to access image tokens.
  • To analyze the role of visual information flow, three attention knockout strategies were employed:
    • (b) Image-to-generated knockout: Image tokens influence generated tokens only indirectly via query tokens.
    • (c) Image-to-query knockout: Query tokens are prevented from accessing image information, isolating them from the visual context.
    • (d) Image-to-others knockout: Image tokens are blocked from attending to all other tokens.
  • (e) Evaluation: The performance of the model under each knockout configuration (at all layers) was measured. In the Image-to-generated configuration, the model achieved an F1 score of 0.4, demonstrating that query tokens successfully encode and relay global visual information. In contrast, the Image-to-query configuration led to a complete failure, highlighting the essential role of query tokens as global image descriptors.
  • (f) Knockout from layer onward: The analysis was extended by applying knockouts starting at different layers. Results indicate a significant increase in F1 scores at the mid-layers, underscoring their critical role.

This analysis highlights the importance of query tokens in encoding global visual information and emphasizes the mid-layers' pivotal role in visual information flow.


MY ALT TEXT

Details localized in the mid-layers

The figure below visualizes the alignment between generated token attention to image tokens and object locations.

Pseudo ground truth object masks were obtained using SAM to validate the alignment. The peak of the attention maps, marked with a white cross, consistently matches the location of the objects in the image, demonstrating the model's ability to attend to specific visual elements.

Image re-prompting

Our analysis reveals that VLMs compress visual information into a small subset of highly attended tokens, enabling the creation of a compressed context consisting of the top-K% of image tokens and query tokens. This compressed context facilitates efficient image re-prompting, allowing the model to answer additional questions without re-processing the full image, achieving near-original performance (96% accuracy) while using significantly fewer tokens—just 5% of the image tokens.


MY ALT TEXT

The tables summarize the Image re-prompting evaluation across 10 perception tasks from the MME benchmark. Metrics include accuracy (ACC), ACC+ (percentage of questions answered correctly per image), and the number of tokens used for re-prompting. The Naive model (i.e., full access to all tokens) achieves slightly higher accuracy, but the compressed context with K=5% retains 96% of the performance while using 12x fewer tokens, highlighting its efficiency for image re-prompting.

Results of our method on MME benchmark

Acknowledgements

The authors would like to thank Mor Geva Pipek, Yossi Gandelsman, and Boaz Nadler for their valuable feedback. This project was supported by an ERC starting grant OmniVideo (10111768). Dr Bagon received funding under the MBZUAI-WIS Joint Program for AI Research.

BibTeX


          @misc{kaduri2024_vision_of_vlms,
                title={What's in the Image? A Deep-Dive into the Vision of Vision Language Models}, 
                author={Omri Kaduri and Shai Bagon and Tali Dekel},
                year={2024},
                eprint={2411.17491},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2411.17491}, 
          }