A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Saurabh Dash, Yiyang Nan, Arash Ahmadian, John Dang · 2025-03-04 · via Hugging Face - Blog

Back to Articles

With the release of the Aya Vision family, our new 8B and 32B parameter vision-language models (VLMs), we are addressing one of the biggest challenges in AI: bringing multilingual performance to multimodal models.

Aya Vision is Cohere For AI's latest open-weight multilingual and multimodal model family, designed to be a strong foundation for language and vision understanding across 23 languages. It builds on the success of Aya Expanse, state-of-the-art multilingual language models, and extends it using a combination of advanced techniques. These include synthetic annotations, scaling up multilingual data through translation and rephrasing, and multimodal model merging – key methods that improve both language and vision understanding in a multilingual setting.

As a result, our models perform well in a variety of tasks, including image captioning, visual question answering, text generation, and translating both text and images into clear, natural-language text. We evaluated Aya Vision models on a set of datasets, including our new open-ended vision-language benchmark AyaVisionBench and a multilingual version of Wild Vision Bench (mWildVision) that is translated into 23 languages, which we release both of them for research.

In pair-wise comparison, Aya Vision 32B outperforms models more than 2x of its size, such as Llama-3.2 90B Vision, Molmo 72B, and Qwen2.5-VL 72B by win rates ranging from 50% to 64% on AyaVisionBench and 52% to 72% on mWildVision average across 23 languages.

Our compact and more efficient model Aya Vision 8B achieves the best performance in multilingual multimodal in its parameter class, outperforming leading models such as Qwen2.5-VL 7B, Pixtral 12B, Gemini Flash 1.5 8B, Llama-3.2 11B Vision, Molmo-D 7B, and Pangea 7B by up to 79% win-rates on AyaVisionBench and 81% on mWildBench.

We release both 8B and 32B models as open weights for the research community to further accelerate multilingual multimodal progress. In this blog post, we share the key technical details behind Aya Vision models

Aya Vision Architecture and Training

For a high-performance vision-language model, it is important to process images with arbitrary resolutions, especially high-resolution images. To enable this capability in Aya Vision, we dynamically resize and split any higher-resolution images into multiple tiles to generate rich image features from the image encoder. In Aya Vision models, we use the recently released SigLIP2-patch14-384 model as the initialization for the vision encoder.

While dynamic resizing enables processing high-resolution images, it also leads to a larger number of image tokens passing through the vision-language connector and LLM decoder. To improve latency and throughput, we use a downsampling method called Pixel Shuffle, to compress the number of image tokens by 4x. After downsampling, image tokens are aligned to the language model input embeddings through a vision-language connector and passed to an LLM decoder.

For the text decoder, we use our multilingual language models. For Aya Vision 8B, we use an LLM that is initialized from Cohere Command R7B for improved instruction following and world knowledge and further post-trained using the Aya Expanse recipe consisting of diverse multilingual data, model merging, and preference training. For Aya Vision 32B, we initialize the language model from Aya Expanse 32B based on its state-of-the-art multilingual performance.

Training process

We trained Aya Vision models in 2 stages – vision-language alignment and supervised fine-tuning (SFT). In the vision-language alignment stage, only the vision-language connector is trained, while the vision encoder and the language model weights are kept frozen. This enables rudimentary vision-language understanding by mapping the image encoder features to the language model embedding space. In the SFT stage, we train both the connector and the language model on a diverse set of multimodal tasks in 23 languages.

Multimodal Data Enhancement and Expanding Language Coverage

One of the biggest challenges in developing a multilingual vision-language model is ensuring strong performance across underrepresented languages. To address this, we first gather synthetic annotations using a diverse pool of high-quality datasets in English, which lay the basis for our multilingual multimodal annotation. Following the synthetic annotations of English datasets, we translated a large volume of the data into 23 languages. To avoid translation artefacts and maintain fluent textual characteristics with high precision in answers, we then rephrased translated prompt/generation pairs by matching them with the original high-quality synthetic samples, expanding language coverage where real-world datasets are scarce. This improves both linguistic fluency and alignment between vision and text, allowing Aya Vision to exhibit superior image understanding in multiple languages.

Our 8B model, when only supervised fine-tuned with original academic datasets, reaches a 40.9% win rate across 23 languages in AyaVisionBench against Pangea 7B, which is a multilingual VLM, whereas synthetic annotations and scaling up the multilingual data lead to a 58.1% win rate with a gain of 17.2%. This significant improvement showcases the impact of significant investment in multilingual data coverage.

Multimodal Model Merging

A state-of-the-art vision-language model should excel not only in image understanding but also in conversational context, where the model is expected to generate a high-quality response to both image and text inputs. To address this, inspired by our previous research on model merging, a technique that combines multiple trained models, we merge the base language model with the fine-tuned vision-language model.

Model merging enhances the generative capabilities of our final model that leads to a 70% win rates across 23 languages on AyaVisionBench against Pangea 7B, improving the multimodal win rate by 11.9% compared to the model before merging.

Multimodal model merging also enables our Aya Vision models to excel in text-only tasks as measured in mArenaHard datasets compared with the other leading vision-language models.


Overview of the training pipeline for Aya Vision

Scaling up to 32B

Finally, we scale our recipe from 8B to 32B, resulting in the state-of-the-art open-weight multilingual vision-language model – Aya Vision 32B which shows significant improvements in win rates due to the stronger initialization of the text-backbone, and outperforms models more than 2x of its size, such as Llama-3.2 90B Vision, Molmo 72B, and Qwen2.5-VL 72B by win rates ranging from 49% to 63% on AyaVisionBench and 52% to 72% on mWildVision average across 23 languages.

Aya Vision Benchmark – a multilingual evaluation data

Together with Aya Vision models, we also release a high-quality multilingual vision-language benchmark called AyaVisionBench, constructed based on real-world applications, covering 23 languages and 9 distinct task categories, with 135 image-question pairs per language.

We make this evaluation set available to the research community to push forward multilingual multimodal evaluations. This dataset is designed to assess a model’s ability to perform a diverse range of vision-language tasks, including captioning, chart and figure understanding, identifying differences between two images, general visual question answering, OCR, document understanding, text transcription, reasoning involving logic and math, and converting screenshots to code. By incorporating multiple languages and task types, the dataset provides a broad and challenging evaluation framework for assessing cross-lingual and multimodal understanding.

To create this dataset, we first selected images from the Cauldron held-out test set, a large collection derived from 50 high-quality datasets, ensuring they had not been seen during training. For each image, we then generated a corresponding question that explicitly required visual context for an answer. These questions were synthetically generated and subsequently refined through a two-stage verification process. First, human annotators reviewed and validated each question to ensure it was clear, relevant, and truly dependent on the image. This rigorous selection and validation process ensures that the dataset serves as a robust benchmark for evaluating vision-language models in multilingual and real-world settings.

Designed for real-world applications

Communication happens in many forms and in many languages. With our leading research and development, we’ve released a model that facilitates connection, whether in text or visual, in 23 different languages today.

Aya Vision has a wide range of practical applications, where one notable example is its availability on WhatsApp, one of the most broadly used communications platforms in the world. This allows a massive audience of global citizens who speak a multitude of languages to utilize the capabilities of Aya Vision on a platform they use to communicate every single day.

Getting Started with Aya

To get started:

Download weights and datasets from the Aya Vision collection on Hugging Face.
Try Aya Vision using our Hugging Face Space or text it on Whatsapp
Build on Aya using our colab example.

Learn more about our ongoing efforts around multilingual.

Acknowledgments

This work wouldn’t have been possible without the core Aya Vision technical team:

Saurabh Dash, Oliver Nan, John Dang, Arash Ahmadian Dehkordi, Shivalika Singh, Alejandro Salamanca, Bharat Venkitesh, Vlad Shmyhlo, Walter Beller-Morales, Jeremy Pekmez, Jason Ozuzu, Madeline Smith, Marzieh Fadaee, Manoj Govindassamy, Sudip Roy, Matthias Gallé, Beyza Ermis, Ahmet Üstün, Sara Hooker.

It also wouldn’t have been possible without the wider Cohere For AI and Cohere team who supported in many different ways. Special thanks to Sungjin Hong, Michael Kozakov, Pierre Richemond, Brittawnya Prince, Jim Payne, Kyle Lastovica, Jeff Colen, Jenna Cook, Viraat Aryabumi, Trent Fowler, Linus Chui, Meor Amer, Lucas Fayoux, Kyle Lastovica, Billy Trend, Acyr Locatelli, Morgan Norman, Florian Strub, Jon Ander Campos, Nick Frosst, Phil Blunsom, Aidan Gomez, Ivan Zhang.

Special thank you to Hugging Face for helping make this come together: Yoni Gozlan, Arthur Zucker, Pedro Cuenca, Aritra Roy Gosthipaty, Merve Noyan, Vaibhav Srivastav.

References

[1] Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier
[2] Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
[3] WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
[4] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
[5] What matters when building vision-language models?
[6] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
[7] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Aya Vision Architecture and Training

Training process

Multimodal Data Enhancement and Expanding Language Coverage

Multimodal Model Merging

Scaling up to 32B

Aya Vision Benchmark – a multilingual evaluation data

Designed for real-world applications

Getting Started with Aya

Acknowledgments

References