SigLIP 2: A better multilingual vision language encoder

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Aritra Roy Gosthipaty, merve, Pavel Iakubovskii · 2025-02-21 · via Hugging Face - Blog

Back to Articles

TL;DR
Introduction
Add a decoder (it’s that simple)
Self-distillation with Global-Local loss and Masked Prediction
Adapting to different resolutions
Run inference with transformers
Zero-shot Classification
Encode images for downstream tasks
Comparing SigLIP 1 with SigLIP 2
Using the encoder for VLMs
Acknowledgements
TL;DR

Today Google releases a new and better family of multilingual vision-language encoders, SigLIP 2. The authors have extended the training objective of SigLIP (sigmoid loss) with additional objectives for improved semantic understanding, localization, and dense features.

SigLIP 2 models outperform the older SigLIP ones at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs).

A cherry on top is the dynamic resolution (naflex) variant. This is useful for downstream tasks sensitive to aspect ratio and resolution.

Here is a list of all the models released:

Introduction

Vision encoders are simple - they take an image, encode it into a representation, and that representation is used for downstream tasks like classification, object detection, image segmentation, and more vision tasks. Researchers are always in pursuit of visual representations that are dense, locality-aware, and semantically rich.

CLIP and ALIGN are the first examples of image encoders and text encoders aligned together through joint training. This approach opened new ways to train vision models. SigLIP took it further, replacing CLIP's contrastive loss with sigmoid loss for even better encoders.

The takeaway? With smarter training objectives, we keep building vision encoders that are more structured, fine-grained, and powerful. SigLIP 2 is just that, a bunch of really interesting and smart training objectives applied on top of that of SigLIP's to provide better and stronger vision language encoders.

We will try something new with this blog post. Rather than stating what is new and where to find it, we will go through a little exercise together. We start off with SigLIP and then brainstorm a series of questions (prefixed with 🤔) and answers (a new heading) to gradually cover all the updates in SigLIP 2. Sounds good?

We will begin our journey with the vision encoder where the patch size is 16, and the image resolution is 256. We have four variants to start our training:

🤔 Question 1: What is a (low effort) auxiliary training objective that we can use to learn better visual representations (in terms of location awareness and sense of locality)?

Add a decoder (it’s that simple)

Let’s add a decoder to the mix. Now we have an image encoder, a text encoder, and a text decoder. The text decoder will have three objectives:

Predict a holistic image caption
Predict bounding box coordinates given captions describing specific image regions
Predict region-specific caption given bounding box coordinates

The decoder provides an additional signal to the vision encoder, making it location-aware. This marks the first improvement to the training recipe in SigLIP 2.

🤔 Question 2: How do we improve fine-grained local semantics of the image representation?

Self-distillation with Global-Local loss and Masked Prediction

To improve fine-grained local semantics in image representation, we introduce two key training objectives, Global-Local Loss, and Masked Prediction Loss. Taking inspiration from self-supervised learning literature, we use self-distillation. We can use a model as a teacher, and the same model as a student. Upon each iteration the teacher will be the moving average of the student's parameters.

Global-Local Loss: The student network gets a partial (local) view of the training image, and is trained to match the teacher’s representation, derived from the full image.
Masked Prediction Loss: 50% of the embedded image patches in the student network are masked with mask tokens. The student needs to match the features of the teacher at masked locations.

These objectives teach the vision encoder to be spatially aware and improve its local semantics. The authors add this loss only after 80% of the training is done with the sigmoid and decoder loss. This is done in order to save compute (additional losses are pretty expensive) and to not negatively affect the encoders.

🤔 Question 3: How to adapt models to different resolutions?

Adapting to different resolutions

It is a known fact that image models can be very sensitive to varying resolutions and aspect ratios. Here we can leverage two distinct methodologies to adapt these models on different resolutions and patch sizes.

Fixed resolution variant: Taking the checkpoints from 95% training, we can resize the positional embeddings and the patch embeddings and then continue training for a requested (potentially larger) resolution.
Dynamic resolution variant: Taking inspiration from FlexiViT, which uses inputs with different sequence lengths, and NaViT, which adheres to the native aspect ratios, we can create NaFlex variants. This is interesting because we can use a single model for OCR (little aspect ratio distortion) and document understanding (appropriate resolution).

Models with the -naflex suffix are the dynamic resolution variants. While the fixed-resolution models can be used out of the box with the existing SiglipModel class, you would need to use Siglip2Model to use the naflex variants. We handle this automatically when you use the pipeline API!

This brings us to the end of the evolution from SigLIP to SigLIP 2. In the next sections we will look at applications with SigLIP 2.

Run inference with transformers

Running inference on the models is pretty straightforward. You can copy paste the code below and run inference on a free tier Colab notebook 🚀

To run inference on SigLIP 2, please install transformers from main or from this stable branch: pip install git+https://github.com/huggingface/transformers@v4.49.0-SigLIP-2

Zero-shot Classification

Here we use the handy pipeline API to showcase zero-shot classification capabilities for SigLIP 2.

from transformers import pipeline

ckpt = "google/siglip2-so400m-patch14-384"
pipe = pipeline(model=ckpt, task="zero-shot-image-classification")

inputs = {
    "images": [
        "https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg", # bear
        "https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000776.jpg", # teddy bear
    ],
    "texts": [
        "bear looking into the camera",
        "bear looking away from the camera",
        "a bunch of teddy bears",
        "two teddy bears",
        "three teddy bears"
    ],
}

outputs = pipe(inputs["images"], candidate_labels=inputs["texts"])

Let’s visualize the outputs.


Zero Shot Classification Scores Visualized

Encode images for downstream tasks

You can also encode images using the following:

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

ckpt = "google/siglip2-so400m-patch14-384"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape) # torch.Size([1, 1152])

Comparing SigLIP 1 with SigLIP 2

Looking at the table of all the SigLIP 2 models released, we see two distinct changes from SigLIP:

SigLIP 2 has new variants (naflex) for dynamic resolution.
SigLIP 2 adds a giant (1B) series.

The evaluation table of SigLIP 2 demonstrates its superiority over SigLIP.

Here is a demo where one can compare the zero-shot classification results of SigLIP 1 and SigLIP 2.

Using the encoder for VLMs

Vision encoders aligned to textual information have become increasingly vital in the development of Vision Language Models (VLMs). A common approach to building VLMs involves combining a pretrained vision encoder with a pretrained LLM, and training them together using multimodal data across a diverse set of vision-language tasks.

One standout example of a VLM leveraging the SigLIP family of vision encoders is PaliGemma. One can dive deeper into PaliGemma's capabilities in this PaliGemma blog post. Building on this foundation, the recently introduced PaliGemma 2 takes it a step further by integrating SigLIP with the advanced Gemma 2 LLM. It would be really exciting to swap out SigLIP with SigLIP 2 in a PaliGemma like setting and see how that model fares.

Acknowledgements

We would like to thank Michael Tschannen (first author of SigLIP 2), Vaibhav Srivastav and Sayak Paul for feedback on this blog post. A huge shout out to the Google team for releasing this amazing, and open, model family.

In no particular order we would like to thank Pavel, Ross, Pablo, Pedro, Lysandre and the rest of the Hugging Face team for their immense support and contribution towards this project.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Introduction

Add a decoder (it’s that simple)

Self-distillation with Global-Local loss and Masked Prediction

Adapting to different resolutions

Run inference with transformers

Zero-shot Classification

Encode images for downstream tasks

Comparing SigLIP 1 with SigLIP 2

Using the encoder for VLMs

Acknowledgements