What’s in the Image? A Deep-Dive into the Vision of Vision Language Models

Omri Kaduri^*, Shai Bagon^*, Tali Dekel

Weizmann Institute of Science
^*Indicates Equal Contribution

CVPR 2025

Paper Code

Visual Information Knockout

Our analysis reveals that VLMs compress high-level image information into query text tokens, enabling the model to generate descriptive responses even when direct access to image tokens is blocked. Through knockout experiments, we show that visual information is encoded and retrieved indirectly via the query tokens, highlighting their pivotal role as global image descriptors.

LLM-as-a-judge for Knockout Evaluation

Automatically counting the number of identified or hallucinated objects in free-text paragraphs is challenging.
Therefore, we propose the LLM-as-a-judge evaluation protocol, which allows us to automatically assess object identification and hallucinated objects in generated responses under different knockout configurations.

Analyzing Visual Information Flow in Vision-Language Models

The figure below illustrates the flow of visual information in Vision-Language Models (VLMs) through an attention knockout analysis:

(a) Baseline (No Knockout): The VLM employs only causal masking, allowing query and generated tokens to access image tokens.
To analyze the role of visual information flow, three attention knockout strategies were employed:
- (b) Image-to-generated knockout: Image tokens influence generated tokens only indirectly via query tokens.
- (c) Image-to-query knockout: Query tokens are prevented from accessing image information, isolating them from the visual context.
- (d) Image-to-others knockout: Image tokens are blocked from attending to all other tokens.
(e) Evaluation: The performance of the model under each knockout configuration (at all layers) was measured. In the Image-to-generated configuration, the model achieved an F1 score of 0.4, demonstrating that query tokens successfully encode and relay global visual information. In contrast, the Image-to-query configuration led to a complete failure, highlighting the essential role of query tokens as global image descriptors.
(f) Knockout from layer onward: The analysis was extended by applying knockouts starting at different layers. Results indicate a significant increase in F1 scores at the mid-layers, underscoring their critical role.

This analysis highlights the importance of query tokens in encoding global visual information and emphasizes the mid-layers' pivotal role in visual information flow.

Details localized in the mid-layers

The figure below visualizes the alignment between generated token attention to image tokens and object locations.

Pseudo ground truth object masks were obtained using SAM to validate the alignment. The peak of the attention maps, marked with a white cross, consistently matches the location of the objects in the image, demonstrating the model's ability to attend to specific visual elements.

Image re-prompting

Our analysis reveals that VLMs compress visual information into a small subset of highly attended tokens, enabling the creation of a compressed context consisting of the top-K% of image tokens and query tokens. This compressed context facilitates efficient image re-prompting, allowing the model to answer additional questions without re-processing the full image, achieving near-original performance (96% accuracy) while using significantly fewer tokens—just 5% of the image tokens.

The tables summarize the Image re-prompting evaluation across 10 perception tasks from the MME benchmark. Metrics include accuracy (ACC), ACC+ (percentage of questions answered correctly per image), and the number of tokens used for re-prompting. The Naive model (i.e., full access to all tokens) achieves slightly higher accuracy, but the compressed context with K=5% retains 96% of the performance while using 12x fewer tokens, highlighting its efficiency for image re-prompting.

Acknowledgements

The authors would like to thank Mor Geva Pipek, Yossi Gandelsman, and Boaz Nadler for their valuable feedback. This project was supported by an ERC starting grant OmniVideo (10111768). Dr Bagon received funding under the MBZUAI-WIS Joint Program for AI Research.

BibTeX


          @misc{kaduri2024_vision_of_vlms,
                title={What's in the Image? A Deep-Dive into the Vision of Vision Language Models}, 
                author={Omri Kaduri and Shai Bagon and Tali Dekel},
                year={2024},
                eprint={2411.17491},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2411.17491}, 
          }