GaLore: Advancing Large Model Training on Consumer-grade Hardware

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Titus von Koeller, Jiawei Zhao, Matthew Douglas, Yaowei Zheng, Y · 2024-03-20 · via Hugging Face - Blog

Back to Articles

The integration of GaLore into the training of large language models (LLMs) marks a significant advancement in the field of deep learning, particularly in terms of memory efficiency and the democratization of AI research. By allowing for the training of billion-parameter models on consumer-grade hardware, reducing memory footprint in optimizer states, and leveraging advanced projection matrix techniques, GaLore opens new horizons for researchers and practitioners with limited access to high-end computational resources.

Scaling LLMs with Consumer-Grade Hardware

The capability of GaLore to facilitate the training of models with up to 7 billion parameters, such as those based on the Llama architecture, on consumer GPUs like the NVIDIA RTX 4090, is groundbreaking. This is achieved by significantly reducing the memory requirements traditionally associated with optimizer states and gradients during the training process. The approach leverages the inherent low-rank structure of gradients in deep neural networks, applying a projection that reduces the dimensionality of the data that needs to be stored and manipulated.

Memory Efficiency in Optimizer States

The optimizer state, especially in adaptive optimization algorithms like Adam, represents a significant portion of the memory footprint during model training. GaLore addresses this by projecting the gradients into a lower-dimensional subspace before they are processed by the optimizer. This not only reduces the memory required to store these states but also maintains the effectiveness of the optimization process.

The memory savings are substantial, with the authors reporting “more than 82.5% reduction in memory for storing optimizer states during training”, making it feasible to train larger models or use larger batch sizes within the same memory constraints. When combined with 8-bit precision optimizers, these savings can be even more pronounced.

Subspace Switching and Advanced Projection Techniques

A critical component of GaLore's effectiveness is its dynamic subspace switching mechanism, which allows the model to navigate through different low-rank subspaces throughout the training process. This ensures that the model is not confined to a limited portion of the parameter space, thus preserving the capacity for full-parameter learning. The decision on when and how to switch subspaces is pivotal, with the frequency of these switches being a balance between maintaining a consistent optimization trajectory and adapting to the evolving landscape of the gradient's low-rank structure.

The ability to dynamically adjust these projections in response to changes in the gradient structure is a potent tool in the GaLore arsenal, allowing for more nuanced control over the memory-optimization trade-offs inherent in training large models.

Combining GaLore with 8-bit Optimizers

The combination of GaLore with 8-bit precision optimizers represents a synergy that maximizes memory efficiency while maintaining the integrity and performance of the training process. 8-bit optimizers reduce the memory footprint by quantizing the optimizer states. When used in conjunction with GaLore's projection mechanism, the result is a highly memory-efficient training regime that does not compromise on model accuracy or convergence speed.

This combination is particularly effective in scenarios where memory is a critical bottleneck, such as training large models on consumer-grade hardware or deploying models in memory-constrained environments. It enables the use of more complex models and larger datasets within the same hardware constraints, pushing the boundaries of what can be achieved with limited resources.

Implementation Details

Integrating 8-bit optimizers with GaLore for training large language models (LLMs) involves quantizing the gradients, weights, and optimizer states to 8-bit representations. This quantization process significantly reduces the memory footprint, enabling the training of larger models or the use of larger batch sizes within the same memory constraints. The algorithmic details of this integration involve several key steps, some of which would benefit significantly from native CUDA implementation for efficiency gains. GaLore opens new possibilities to integrate these techniques even more tightly with quantization and specialized parameterization of the matrices, which can lead to further reductions in memory usage. We are currently exploring this direction in the bitsandbytes library.

Algorithmic Overview of 8-bit Optimization with GaLore

Gradient Projection: GaLore projects the full-precision gradients into a low-rank subspace using projection matrices. This step reduces the dimensionality of the gradients, which are then quantized to 8-bit format.

Quantization: The projected gradients, along with the model weights and optimizer states (such as the moving averages in Adam), are quantized from 32-bit floating-point to 8-bit integer representations. This involves scaling the floating-point values to the 8-bit range and rounding them to the nearest integer.

Optimizer Update: The 8-bit quantized gradients are used to update the model weights. This step involves de-quantizing the gradients back to floating-point format, applying the optimizer's update rule (e.g., Adam's moment update and parameter adjustment), and then quantizing the updated optimizer states back to 8-bit for storage.

De-quantization and Weight Update: The 8-bit quantized weights undergo de-quantization to a floating-point representation for processing, albeit retaining the 8-bit precision inherent to their quantized form due to the limited range of values. This step is needed because standard operations in frameworks like PyTorch do not support 8-bit integers, and such integer weights cannot accommodate gradients. While this approach does not inherently enhance accuracy, it facilitates the practical application and gradient computation of quantized weights within the constraints of current deep learning libraries. Note that after de-quantization and before applying the weight update, GaLore employs one more projection that projects de-quantized low-rank updates back to the original space.

Use it with Hugging Face Transformers

To use GaLore optimizers with the Hugging Face transformers library, you first need to update it to a version that supports GaLore optimizers, by either installing the latest update, i.e. pip install transformers>=4.39.0 or installing transformers from source.

Then install the galore-torch library with pip install galore-torch. Below is a full working example of GaLore with transformers, for pretraining Mistral-7B on the imdb dataset:

import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="galore_adamw",
    optim_target_modules=["attn", "mlp"]
)

model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

trainer.train()

TrainingArguments: Simply pass a valid optim_target_modules (it supports a single string, regex, or a list of strings or regexes) as well as, for optim, a valid GaLore optimizer, such as galore_adamw, galore_adamw_8bit, galore_adafactor – and you’re good to go!

Layer-wise Updates

Another important point to mention are the layer-wise optimizers (i.e. updating weights one layer at a time). Typically, the optimizer performs a single weight update for all layers after backpropagation. This is done by storing the entire weight gradients in memory. By adopting layer-wise weight updates, we can further reduce the memory footprint during training. Under the hood, this is implemented with PyTorch post-accumulation hooks on the layers the users want to update.

To use this feature, simply append _layerwise to the optimizer names, for example galore_adamw_layerwise.

Conclusion

GaLore, with its innovative approach to leveraging the low-rank structure of gradients, represents a significant step forward in the memory-efficient training of LLMs. By enabling the training of billion-parameter models on consumer-grade hardware, reducing the memory footprint of optimizer states through projection techniques, and allowing for dynamic subspace switching, GaLore democratizes access to large-scale model training. The compatibility of GaLore with 8-bit precision optimizers further enhances its utility, offering a pathway to training larger and more complex models without the need for specialized computational resources. This opens up new possibilities for research and application in AI, making it an exciting time for practitioners and researchers alike.

Resources

Please refer to the original paper. Twitter references: 1 2 3. The paper also draws comparisons between GaLore and ReLoRA, which might be of interest to some readers. For readers with questions that remain unanswered, especially after review of the paper, or who would like to constructively discuss the results, please feel free to join the author’s Slack community. For those interested in further releases along these lines, please follow Jiawei Zhao and Titus von Koeller (for information on the latest bitsandbytes releases) as well as Younes Belkada for the latest and greatest infos on quantization-related topics within and around the Hugging Face ecosystem.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Scaling LLMs with Consumer-Grade Hardware

Memory Efficiency in Optimizer States

Subspace Switching and Advanced Projection Techniques

Combining GaLore with 8-bit Optimizers

Implementation Details

Algorithmic Overview of 8-bit Optimization with GaLore

Use it with Hugging Face Transformers

Layer-wise Updates

Conclusion

Resources