惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Recent Commits to openclaw:main
Recent Commits to openclaw:main
博客园 - 叶小钗
Stack Overflow Blog
Stack Overflow Blog
S
SegmentFault 最新的问题
D
DataBreaches.Net
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Threatpost
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
Microsoft Azure Blog
Microsoft Azure Blog
WordPress大学
WordPress大学
Engineering at Meta
Engineering at Meta
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
Cisco Blogs
PCI Perspectives
PCI Perspectives
Project Zero
Project Zero
G
Google Developers Blog
宝玉的分享
宝玉的分享
H
Heimdal Security Blog
美团技术团队
Schneier on Security
Schneier on Security
C
CERT Recently Published Vulnerability Notes
Martin Fowler
Martin Fowler
博客园 - 司徒正美
博客园 - 三生石上(FineUI控件)
Help Net Security
Help Net Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Google DeepMind News
Google DeepMind News
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
L
LINUX DO - 最新话题
O
OpenAI News
Hacker News - Newest:
Hacker News - Newest: "LLM"
N
Netflix TechBlog - Medium
S
Security Affairs
小众软件
小众软件
MongoDB | Blog
MongoDB | Blog
Blog — PlanetScale
Blog — PlanetScale
V
V2EX - 技术
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
F
Fortinet All Blogs
G
GRAHAM CLULEY
云风的 BLOG
云风的 BLOG
S
Secure Thoughts

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap
Transformers v5: Simple model definitions powering the AI ecosystem
Lysandre, Arthur Zucker, Cyril Vallez, Vaibhav Srivastav · 2025-12-01 · via Hugging Face - Blog

Back to Articles

Transformers standardizing model definitions

Transformers' version v4.0.0rc-1, the initial release candidate for version 4, was released on November 19th, 2020. Five years later, we now release v5.0.0rc-0.

Today, as we launch v5, Transformers is installed more than 3 million times each day via pip - up from 20,000/day in v4 🤯. Altogether, it has now surpassed 1.2 billion installs!

The ecosystem has expanded from 40 model architectures in v4 to over 400 today, and the community has contributed more than 750,000 model checkpoints on the Hub compatible with Transformers, up from roughly 1,000 at the time of v4.

This growth is powered by the evolution of the field and the now mainstream access to AI. As a leading model-definition library in the ecosystem, we need to continuously evolve and adapt the library to continue being relevant. Reinvention is key for longevity in AI.

We’re fortunate to collaborate with many libraries and apps built on transformers, in no specific order: llama.cpp, MLX, onnxruntime, Jan, LMStudio, vLLM, SGLang, Unsloth, LlamaFactory, dLLM, MaxText, TensorRT, Argmax, among many other friends.

For v5, we wanted to work on several notable aspects: simplicity, training, inference, and production. We detail the work that went into them in this post.

Simplicity

The first focus of the team was on simplicity. Working on transformers, we see the code as the product. We want our model integrations to be clean, so that the ecosystem may depend on our model definitions and understand what’s really happening under the hood, how models differ from each other, and the key features of each new model. Simplicity results in wider standardization, generality, and wider support.

Model Additions

Transformers is the backbone of hundreds of thousands of projects, Unsloth included. We build on Transformers to help people fine-tune and train models efficiently, whether that’s BERT, text-to-speech (TTS), or others; to run fast inference for reinforcement learning (RL) even when models aren’t yet supported in other libraries. We're excited for Transformers v5 and are super happy to be working with the Hugging Face team!

-- Michael Han at Unsloth

Transformers, at the core, remains a model architecture toolkit. We aim to have all recent architectures and to be the “source of truth” for model definitions. We’ve been adding between 1 - 3 new models every week for 5 years, shown in the timeline below:

We’ve worked on improving that model-addition process.

Modular Approach

Over the past year, we’ve heavily pushed our modular design as a significant step forward. This allows for easier maintenance, faster integration, and better collaboration across the community.

We give a deeper overview in our Maintain the Unmaintainable blog post. For brevity, we aim to achieve a much easier model contribution process, as well as a lower maintenance burden. One metric we can highlight is that the number of lines of code to contribute (and review), drop significantly when modular is used:

Transformers standardizing model definitions

While we respect the “One model, one file” philosophy, we continue introducing some abstractions making the management of common helpers simpler. The prime example of this is the introduction of the AttentionInterface, which offers a centralized abstraction for attention methods. The eager method will remain in the modeling file; others, such as FA1/2/3, FlexAttention, or SDPA, are moved to the interface.

Over the past couple of years, the increasing amount of 0-day support for new model architectures and standardization of attention handling has helped to simplify our support for post-training modern LLMs.

-- Wing Lian, Axolotl

Tooling for Model Conversion

We’re building tooling to help us identify which existing model architecture a new model resembles. This feature uses machine learning to find code similarities between independent modeling files. Going further, we aim to automate the conversion process by opening a draft PR for the model to be integrated into our transformers format. This process reduces manual effort and ensures consistency.

Code Reduction

Streamlining Modeling & Tokenization/Processing Files

We’ve significantly refactored the modeling and tokenization files. Modeling files have been greatly improved thanks to the modular approach mentioned above, on top of standardization across models. Standardization contributes to abstracting most of the tools that don’t make up a model, so that the modeling code only contains the relevant parts for a model’s forward/backward passes.

Alongside this work, we’re simplifying the tokenization and processing files: going forward, we’ll only focus on the tokenizers backend, removing the concept of “Fast” and “Slow” tokenizers.

We'll use tokenizers as our main tokenization backend, just as we do for PyTorch-based models. We’ll offer alternatives for Sentencepiece or MistralCommon backed tokenizers, which will be non-default but will be supported. Image processors will now only exist with their fast variant, which depends on the torchvision backend.

Finally, we’re sunsetting our Flax/TensorFlow support in favor of focusing on PyTorch as the sole backend; however, we're also working with partners in the Jax ecosystem to ensure we have compatibility between our models and this ecosystem.

With its v5 release, transformers is going all in on PyTorch. Transformers acts as a source of truth and foundation for modeling across the field; we've been working with the team to ensure good performance across the stack.

We're excited to continue pushing for this in the future across training, inference, and deployment.

-- Matt White, Executive Director, PyTorch Foundation. GM of AI, Linux Foundation

Training

Training remains a big focus of the team as we head into v5: whereas previously we would focus heavily on fine-tuning rather than pre-training/full-training at scale, we’ve recently done significant work to improve our support for the latter as well.

Pre-training at scale

Supporting pre-training meant reworking the initialization of our models, ensuring that they worked at scale with different parallelism paradigms, and shipping support for optimized kernels for both the forward and backward passes.

Going forward, we’re excited to have extended compatibility with torchtitan, megatron, nanotron, as well as any other pre-training tool that is interested in collaborating with us.

Fine-tuning & Post-training

We continue collaborating closely with all fine-tuning tools in the Python ecosystem. We aim to continue providing model implementations compatible with Unsloth, Axolotl, LlamaFactory, TRL and others in the PyTorch ecosystem; but we are also working with tools such as MaxText, in the JAX ecosystem, to have good interoperability between their frameworks and transformers.

All fine-tuning and post-training tools can now rely on transformers for model definitions; further enabling Agentic use-cases through OpenEnv or the Prime Environment Hub.

Inference

We’re putting a significant focus on inference for v5, with several paradigm changes: the introduction of specialized kernels, cleaner defaults, new APIs, support for optimized inference engines.

Similarly to training, we’ve been putting some effort in packaging kernels so that they’re automatically used in case your hardware and software permits it. If you haven’t heard of kernels before, we recommend taking a look at this doc.

Alongside this effort, we ship two new APIs dedicated to inference:

  • We ship support for continuous batching and paged attention mechanisms. This has now been used internally for some time, and we’re working on finalizing the rough edges and writing usage guides.
  • We introduce transformers serve as the new transformers-specific serving system, which deploys an OpenAI API-compatible server.

We see this as a major step forward for use-cases such as evaluation, where a great number of inference requests are done simultaneously. We don’t aim to do specialized optimizations like the dedicated inference engines (vLLM, SGLang, TensorRT LLM). Instead, we aim to be perfectly inter-compatible with these, as detailed in the next section.

The Transformers backend in vLLM has been very enabling to get more architectures, like BERT and other encoders, available to more users. We've been working with the Transformers team to ensure many models are available across modalities with the best performance possible. This is just the start of our collaboration: we're happy to see the Transformers team will have this as a focus going into version 5.

-- Simon Mo, Harry Mellor at vLLM

Standardization is key to accelerating AI innovation. Transformers v5 empowers the SGLang team to spend less time on model reimplementation and more time on kernel optimization. We look forward to building a more efficient and unified AI ecosystem together!

-- Chenyang Zhao at SGLang

Production & Local

Recently, we've been working hand in hand with the most popular inference engines for them to use transformers as a backend. The value added is significant: as soon as a model is added to transformers, it becomes available in these inference engines, while taking advantage of the strengths each engine provides: inference optimizations, specialized kernels, dynamic batching, etc.

We've also been working very closely with ONNXRuntime, llama.cpp and MLX so that the implementations between transformers and these modeling libraries have great interoperability. For example, thanks to a significant community effort, it's now very easy to load GGUF files in transformers for further fine-tuning. Conversely, transformers models can be easily converted to GGUF files for use with llama.cpp.

The Transformers framework is the go-to place for reference AI model implementations. The framework plays a crucial role in enabling modern AI across the entire stack. The team and the community behind the project truly understand and embrace the spirit of the open-source development and collaboration.

-- Georgi Gerganov, ggml-org

The same is true for MLX, where the transformers' safetensors files are directly compatible with MLX's models.

It’s hard to overstate the importance of Transformers (and datasets, tokenizers, etc) to the open-source and overall AI ecosystem. I can’t count the number of times I’ve personally used Transformers as a source-of-truth.

-- Awni Hannun, MLX

Finally, we’re pushing the boundaries of local inference and are working hand-in-hand with the executorch team to get the transformers models to be available on-device. We’re expanding the coverage to multimodal models (vision, audio) through optimum.

Quantization

Quantization is quickly emerging as the standard for state-of-the-art model development. Many SOTA models are now released in low-precision formats such as 8-bit and 4-bit (e.g., gpt-oss, Kimi-K2, Deepseek-r1), hardware is increasingly optimized for low-precision workloads, and the community is actively sharing high-quality quantized checkpoints. In v5, we're making quantization a central focus of Transformers support, ensuring full compatibility with all major features, and delivering a reliable framework for training and inference.

We introduce a significant change to the way we load weights in our models; and with this, we move to quantization being a first-class citizen.

Our collaboration with the Transformers team was highly productive, marked by their proactive code reviews, feedback, and technical expertise. Their support was crucial in integrating TorchAO, expanding quantization features, and improving documentation for broader adoption in the V5.

-- Jerry Zhang at TorchAO

We're excited that v5 has made quantization a first-class citizen. It provides the foundation for bitsandbytes to better support key features like TP and MoEs, and also makes it easier to integrate new quantization methods.

-- Matthew Douglas & Titus von Koeller, bitsandbytes

Conclusion

The overarching theme of this version 5 release is “interoperability”. All refactors, performance improvements, and standardization are aligned with this theme. v5 plays nicely and end-to-end with the growing ecosystem: train a model with Unsloth/Axolotl/LlamaFactory/MaxText deploy it with vLLM/SGLang, and export it to llama.cpp/executorch/MLX to run locally!

Version 5 is undeniably an accomplishment of the past five years by a very large number of people in our community. We also see it as a promise, and as a beacon of the direction we want to go.

We took it as an opportunity to clean up the toolkit and isolate what mattered; we now have a clean slate on top of which to build. Thanks to the many changes from the community and team, improvements in performance, usability, and readability, will be simpler to ship.

Now that v5.0.0's first RC is out there, we'll be eagerly awaiting your feedback. Please check our release notes for all the technical details, and we'll be awaiting your feedback in our GitHub issues!