Papers
arxiv:2604.12012

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Published on Apr 13
· Submitted by
bingyi
on Apr 20
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Enhanced vision-language models achieve superior dense patch-text alignment through improved pretraining techniques including patch-level distillation, modified masked image objectives, and optimized caption sampling strategies.

AI-generated summary

Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .

Community

Paper author Paper submitter

Google DeepMind released TIPSv2, a foundational visual encoder unblocking spatially-aware representations with strong overall results and significant gains on patch-text alignment. 🔥 💪

It all starts with a a puzzling findings: their smaller, distilled models were surperisingly outperforming the massive pretrained models they were distilled from, on patch-text alignment. 🤔

Investigating this deeply, leading to an improved pretraining recipe that fundamentally upgrades the vision-language encoder. Here are the three key changes TIPSv2 introduced:

  • iBOT++: it extends the patch-level self-supervised loss to all tokens (not just masked ones), for dramatically stronger dense alignment.
  • Head-only EMA: by applying EMA updates only to projection layers, it drastically slashed memory requirements and training costs while retaining high performance.
  • Multi-Granularity Caption: combining PaliGemma and Gemini descriptions for richer, more robust text supervision.

Combining these components, TIPSv2 demonstrates strong performance across 9 tasks and 20 datasets, generally on par with or better than recent vision encoder models, with particularly strong gains in zero-shot segmentation. 📈

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.12012
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.12012 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.12012 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.12012 in a Space README.md to link it from this page.

Collections including this paper 1