
























In this blogpost, we present the key highlights and rationales about the Falcon-Edge series - a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture.
Drawing from our experience with BitNet, Falcon-Edge introduces and validates an new pre-training paradigm that delivers a full-scope output from a single training process, simultaneously yielding both non-quantized and quantized model variants. This comprehensive approach produces a non-BitNet model in bfloat16 format, the native BitNet model, and a pre-quantized BitNet variant specifically engineered for effortless fine-tuning, enabling users and developers to precisely tailor these models to their specific applications and needs.
Available now in two sizes—1 Billion and 3 Billion parameters—each size comes in both base and instruction-tuned models. Discover the Falcon-Edge series on our dedicated Hugging Face collection.
Large Language Models (LLMs), by design, are inherently large and resource-intensive. As demand grows to deploy these models efficiently on edge devices, research into model compression has accelerated. Recent efforts, such as those by DeepSeek and Llama 4, explore training with reduced precision formats—down to FP8—to improve deployment scalability. On the other hand, many state-of-the-art methods emphasize post-training quantization. In contrast to these approaches, BitNet introduces a fundamentally different paradigm: unlike reduced-precision training which still relies on floating-point formats, and post-training quantization which adjusts weights after full-precision training, BitNet operates with the lowest possible precision — ternary weights ({-1, 0, 1}) — directly during training, enabling an end-to-end ultra-efficient model design.
These ternary weights are paving the way for a “matmul-free” LLM design that is notably faster and remarkably memory-efficient in practice. The primary challenge of this innovative approach is the necessity for pre-training BitNet models, which can be computationally demanding and costly for typical users.
Leveraging the learnings from pre-training data strategies from our center, we pre-train our model on an internal data mixture for approximately 1.5 Tera Tokens. We use the classic WSD learning rate scheduler for pre-training.
We evaluate our models (base and instruct versions) on the former Hugging Face leaderboard v2 benchmark and report the normalized results below in comparison with other models of similar size:
Additional results (leaderboard v1) on comparing our instructed models with Microsoft’s new BitNet model:
Falcon-Edge demonstrates on-par and better performances than models of comparable sizes on the leaderboard v2 tasks, demonstrating that it is possible to train powerful BitNet models on desired domains while being competitive enough on other tasks.
If we look closer at the formula of the BitNet linear layer for inference (in terms of Python code):
def activation_norm_quant(x):
scale = 127.0 / x.abs().max(dim=-1, keepdim=True).values.clamp_(min=1e-5)
y = (x * scale).round().clamp_(-128, 127)
return y, scale
class BitLinear(nn.Linear):
def post_quant_process(self, input, input_scale, weight_scale):
out = input / (input_scale * weight_scale)
return out
def forward(self, input):
w = self.weight
w_quant = unpack_weights(w, dtype=self.dtype)
input_quant, input_scale = self.activation_quant(input)
y = F.linear(input_quant.to(self.dtype), w_quant)
y = self.post_quant_process(y, self.weight_scale, input_scale)
if self.bias is not None:
y += self.bias.view(1, -1).expand_as(y)
return y
The normalization activation_norm_quant quantizes the activations in int8 format, then the activation is computed back in half precision by diving it by x_scale. Since the model has been trained with fake 8-bit activation quantization, we argue that it is possible to approximate that:
x_quant, x_scale = activation_norm_quant(x)
x ~= (x_quant / x_scale)
Therefore, instead of quantizing the model post-training, injecting the weight scale after quantizing the weights should lead to a good enough “approximation” of the non-BitNet version of the model:
def _weight_quant(w):
scale = 1.0 / w.abs().mean().clamp_(min=1e-05)
u = (w * scale).round().clamp_(-1, 1)
return u, scale
for param_name, param_value in state_dict.items():
if _is_param_to_not_quantize(param_name):
continue
param_value, param_scale = _weight_quant(param_value)
param_value = param_value / param_scale
state_dict_quant[param_name] = param_value
We confirm this by running end-to-end evaluations on the bfloat16 variant of our 1B and 3B base models and below are the results:
The bfloat16 counterparts of the models can be loaded directly via Hugging Face transformers by passing revision="bfloat16" in the from_pretrained function:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
model_id = "tiiuae/Falcon-E-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
revision="bfloat16"
)
To the best of our knowledge, except from the most recent release from Microsoft previous BitNet releases only focus on releasing the final quantized model, making it usable only for inference. Similarly to the release from Microsoft, we propose to extend the accessibility of research and application of BitNet models by releasing their pre-quantized weights. That way, users can either perform fine-tuning on their target domain, or do continuous pre-training of the BitNet checkpoint as long as nn.Linear layers are replaced by BitnetLinear layers, and by making sure to quantize the model post training in BitNet format. Since the weights corresponds to the pre-quantized weights, performing text generation without replacing the nn.Linear layers with BitnetLinear layers will produce gibberish output.
The pre-quantized weights can be downloaded via Hugging Face's transformers library by specifying the revision argument to be prequantized:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "tiiuae/Falcon-E-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
revision="prequantized"
)
This way, we will help fostering an ecosystem around first powerful 1-bit fine-tunes by the community. We provide to community the tools to get easily started and fine-tune their own version of powerful BitNet models by packaging all needed utility methods for performing fine-tuning on the pre-quantized weights on a Python package called onebitllms that we will cover in the next section.
onebitllms - a lightweight python package for 1-bit LLMs training toolkit
In this release, we also introduce onebitllms - a lightweight Python package that can be plugged into your favorite LLM fine-tuning tools in order to fine-tune any pre-quantized BitNet model. At this time of writing onebitllms exposes these main functionalities:
trl library.bfloat16 format.BitnetLinear and triton kernels that be injected and used for your pre-training framework.Currently, only full-finetuning is supported through this framework, while in this release the model sizes are relatively small, supporting Parameter-Efficient Fine-tuning (PEFT) methods for BitNet models remains an exciting and impactful open question for upcoming BitNet models.
To get started, simply install the package directly through pip or from source, and take a look at examples/ folders inside the source code.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
from onebitllms import replace_linear_with_bitnet_linear, quantize_to_1bit
model_id = "tiiuae/Falcon-E-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
revision="prequantized"
)
model = replace_linear_with_bitnet_linear(model)
trainer = SFTTrainer(
model,
...
)
trainer.train()
quantize_to_1bit(output_directory)
With this package, we hope to accelerate research and development around ternary format LLMs, and hope to see many derivations of Falcon-Edge and other future powerful BitNet models developed by the community.
We believe this release opens up multiple interesting directions - among all the possible follow up directions, we currently think that the following open questions will make BitNet models much more impactful in the near future:
bitnet.cpp, we hope that this release will convince the research community to focus on developping powerful BitNet inference kernels for faster inference on GPUs - thus making them faster than native models on GPUs.bfloat16 counterpart, thus making it fully performance degradation-free.onebitllms package can serve a as a foundational work for creating first multi-modal Bitnet VLM (Vision Language Model) etc.If you find this work useful for your research and work, please consider citing our work, as well as citing all the foundational work behind BitNet models:
@misc{tiionebitllms,
title = {Falcon-E, a series of powerful, universal and fine-tunable 1.58bit language models.},
author = {Falcon-LLM Team},
month = {May},
url = {https://falcon-lm.github.io/blog/falcon-edge},
year = {2025}
}
@misc{ma2025bitnetb1582b4ttechnical,
title={BitNet b1.58 2B4T Technical Report},
author={Shuming Ma and Hongyu Wang and Shaohan Huang and Xingxing Zhang and Ying Hu and Ting Song and Yan Xia and Furu Wei},
year={2025},
eprint={2504.12285},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.12285},
}
@misc{wang2025bitnetcppefficientedgeinference,
title={Bitnet.cpp: Efficient Edge Inference for Ternary LLMs},
author={Jinheng Wang and Hansong Zhou and Ting Song and Shijie Cao and Yan Xia and Ting Cao and Jianyu Wei and Shuming Ma and Hongyu Wang and Furu Wei},
year={2025},
eprint={2502.11880},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.11880},
}
@misc{,
title={1.58-Bit LLM: A New Era of Extreme Quantization},
author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
year={2024},
}
@misc{ma2024era1bitllmslarge,
title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
author={Shuming Ma and Hongyu Wang and Lingxiao Ma and Lei Wang and Wenhui Wang and Shaohan Huang and Li Dong and Ruiping Wang and Jilong Xue and Furu Wei},
year={2024},
eprint={2402.17764},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2402.17764},
}
@misc{wang2023bitnetscaling1bittransformers,
title={BitNet: Scaling 1-bit Transformers for Large Language Models},
author={Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Huaijie Wang and Lingxiao Ma and Fan Yang and Ruiping Wang and Yi Wu and Furu Wei},
year={2023},
eprint={2310.11453},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.11453},
}
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。