GitHub - cnygaard/glq: E8 lattice codebook quantization for LLM weights — 2/3/4 bpw with fused Triton inference kernel

Post-training weight quantization for LLMs using E8 lattice codebooks.

GLQ encodes each 8-weight group as a 16-bit index into a 65,536-entry E8 lattice codebook. A Randomized Hadamard Transform (RHT) decorrelates the Hessian so that Euclidean nearest-neighbour search is near-optimal under the proxy loss. The result: 2–8 bpw weights with quality comparable to QuIP# / better than GPTQ, and a fused CUDA kernel that matmuls directly against the compressed indices without materializing the weight matrix.

Quickstart

Run a pre-quantized model

pip install glq         # requires PyTorch ≥ 2.0

Python ≥ 3.10. Triton ships with PyTorch on CUDA and is used automatically. The CUDA C extension JIT-builds on first run (~30 s); CPU falls back to dequantize-then-matmul.

import glq.hf_integration  # registers GLQ with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw",
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw")
print(tok.decode(model.generate(
    **tok("The capital of France is", return_tensors="pt").to(model.device),
    max_new_tokens=20,
)[0], skip_special_tokens=True))

import glq.hf_integration registers quant_method="glq" with HF Transformers; from_pretrained then swaps nn.Linear for E8RHTLinear and uses the fused CUDA C kernel on inference. CPU falls back to a naive dequantize-then-matmul.

Available pre-quantized checkpoints

Repo	Base model	bpw	License	VRAM¹	Tok/s² (b1 / b32)
`xv0y5ncu/SmolLM2-135M-Instruct-GLQ-4bpw`	SmolLM2-135M-Instruct	4.0	Apache 2.0	0.18	152 / 4205
`xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw`	SmolLM2-360M-Instruct	4.0	Apache 2.0	0.33	135 / 2990
`xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw`	SmolLM3-3B	3.5 (mixed)	Apache 2.0	2.4	35 / 654
`xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw`	Gemma-4-E4B-it	4.0	Apache 2.0	5.8	33 / 600
`xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw`	Devstral-Small 24B	4.0	Apache 2.0	~20.5	6.6 / —
`xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw`	Nemotron-3-Nano-30B (Mamba-MoE)	4.0	Nemotron	—	—

_{¹ VRAM = resident weight footprint at load (vLLM's Model loading took … GiB)
on a g6e.xlarge (NVIDIA L40S) — the figure that decides whether a model fits a
24/32 GB card; it tracks the bpw budget. Devstral ≈ 20.5 GiB is its HF-transformers
load; Nemotron-30B not measured here.

² Tok/s = total decode throughput, weight-only GLQ, vLLM 0.20.2, short context
(256 generated tokens), same hardware. b1 = single-stream, b32 = 32 concurrent
sequences — a high-batch sample near the throughput knee, not a hard maximum.
Devstral-24B is HF-transformers single-stream (vLLM v1 deadlocks; no batched figure;
see CUDA-graph decode wrapper). Nemotron-3-Nano-30B is a
Mamba-MoE (vLLM-unsupported here, compute-bound) — not benchmarked. E8-KV compression
leaves these short-context numbers unchanged; its payoff is a ~4× smaller KV cache →
more context / concurrency in the same VRAM (see KV cache compression).}

Quantize your own model

pip install 'glq[quantize]'    # adds transformers, datasets, etc.

glq-quantize \
    --model HuggingFaceTB/SmolLM2-360M \
    --output ./smollm2-glq-4bpw \
    --bpw 4 \
    --nsamples 128 \
    --device cuda

Other bit-widths: pass --bpw 2 through --bpw 8 (fractional like 2.5 also works). glq-quantize --help lists every flag. For models that don't fit in system RAM use --streaming (loads one layer at a time from safetensors).

For mixed-precision allocation, run a two-pass flow: a profile pass writes a per-layer bpw_allocation.json, then a quantize pass applies it. See examples/quantize_mixed_precision.md.

Docker image (NVIDIA GPU)

A prebuilt CUDA image ships everything needed to run GLQ models — glq, PyTorch, vLLM, transformers, and lm-eval on CUDA 12.8:

ghcr.io/cnygaard/glq-env:0.5.0     # also :latest, :0.5

Prerequisite — GPU access in Docker. You need an NVIDIA GPU plus the NVIDIA Container Toolkit installed on the host; that's what makes the --gpus all flag pass the GPU into the container. Verify it works:

docker run --rm --gpus all ghcr.io/cnygaard/glq-env:0.5.0 nvidia-smi

If that prints your GPU table, you're set. (No toolkit → --gpus errors with "could not select device driver".)

Produce output. Mount a host directory for the model cache (the image's HF_HOME is /cache/hf, so models persist across runs instead of re-downloading), then generate:

docker run --rm --gpus all \
    -v "$HOME/.cache/huggingface:/cache/hf" \
    ghcr.io/cnygaard/glq-env:0.5.0 \
    python -c '
import glq.hf_integration, torch                      # registers GLQ with HF
from transformers import AutoModelForCausalLM, AutoTokenizer
mid = "xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
    mid, device_map="cuda", torch_dtype=torch.float16)
ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(model.generate(**ids, max_new_tokens=20)[0],
                 skip_special_tokens=True))
'

Expected output:

The capital of France is Paris. It is located in the north of the country.

The first run downloads the model into the mounted cache; later runs reuse it. Swap mid for any GLQ checkpoint (see Available pre-quantized checkpoints).

Flag reference:

Flag	Why
`--gpus all`	binds all host GPUs into the container (needs the NVIDIA Container Toolkit). Use `--gpus '"device=0"'` to pick one.
`-v "$HOME/.cache/huggingface:/cache/hf"`	persists downloaded weights on the host (`HF_HOME=/cache/hf` inside) so they survive `--rm`.
`--rm`	remove the container when it exits (drop it to keep the container around).

Serving (vLLM) & an interactive shell. The image bundles vLLM, so you can serve an OpenAI-compatible endpoint — publish the port and mount the cache:

# Plain chat — the model ships its own chat template, so nothing extra needed:
docker run --rm --gpus all -p 8000:8000 \
    -v "$HOME/.cache/huggingface:/cache/hf" \
    ghcr.io/cnygaard/glq-env:latest \
    vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw --max-model-len 64000

Tool-calling + thinking. Gemma-4's tool template is not in the model (its bundled chat_template.jinja is plain chat) and not in the vLLM pip wheel, so fetch it from vLLM's examples/ first, then mount it:

curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/v0.20.2/examples/tool_chat_template_gemma4.jinja \
    -o tool_chat_template_gemma4.jinja

docker run --rm --gpus all -p 8000:8000 \
    -v "$HOME/.cache/huggingface:/cache/hf" \
    -v "$PWD/tool_chat_template_gemma4.jinja:/work/tool.jinja:ro" \
    ghcr.io/cnygaard/glq-env:latest \
    vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw \
        --max-model-len 64000 \
        --enable-auto-tool-choice \
        --tool-call-parser gemma4 \
        --reasoning-parser gemma4 \
        --chat-template /work/tool.jinja \
        --default-chat-template-kwargs '{"enable_thinking": true}'

Pass --chat-template the in-container mount path (/work/tool.jinja), not the host path. --default-chat-template-kwargs '{"enable_thinking": true}' defaults Gemma-4 reasoning on. The gemma4 parsers and all of these flags are accepted by the image's bundled vLLM 0.20.2 and the model loads; note that startup runs a multi-minute torch.compile + CUDA-graph capture before the endpoint is ready. See the vLLM Gemma-4 recipe for the full tool-calling / reasoning reference.

Image vLLM note: in-image vLLM serving needs an image built from the v0.5.2 Dockerfile fix or later (vLLM now resolves its own matching-CUDA torch). The published :0.5.0 / :0.5.1 snapshots predate that fix and hit ImportError: libcudart.so.13 under --gpus all; on those, use the HF generate path above (works), or pip install glq + your own vLLM. The pip package is unaffected on all versions.

For the long-context E8 KV-cache flags (GLQ_KV_*), pass them with -e and see E8 lattice cache / Inline-dequant E8 KV. The image's default command is a shell (docker run --rm -it --gpus all ghcr.io/cnygaard/glq-env:latest) if you'd rather poke around interactively.

Results

SmolLM3-3B at matched 4.5 bpw vs GPTQ

Blackwell RTX PRO 6000, 128 calibration samples, lm-evaluation-harness limit=200/task (GSM8K n=500, MMLU 50/subtask). GLQ 4.5 bpw uses two-pass mixed allocation (91 layers @ 4 bpw + 161 @ 5 bpw, avg 4.64 bpw).

Task	bf16	GLQ 4.5 bpw	GPTQ W4 g128
ARC-challenge (acc_n)	0.490	0.475	0.420
ARC-easy (acc_n)	0.745	0.735	0.695
HellaSwag (acc_n)	0.660	0.660	0.675
MMLU (acc)	0.617	0.603	0.589
TruthfulQA mc2	0.529	0.545	0.515
WinoGrande	0.655	0.660	0.670
WikiText-2 ppl ↓	10.67	10.90	11.33
GSM8K flex (n=500)	0.722	0.738	0.688
IFEval prompt-strict	0.310	0.310	0.285
IFEval prompt-loose	0.325	0.330	0.295
IFEval inst-strict	0.478	0.472	0.453
IFEval inst-loose	0.494	0.491	0.469

GLQ beats GPTQ on 10/12 metrics. WikiText-2 ppl gap to bf16: +2.2 % (GLQ) vs +6.2 % (GPTQ). GSM8K flex matches bf16; GPTQ drops 0.034.

Small models: SmolLM2-360M-Instruct at 4 bpw

GPTQ requires a group-size dividing the hidden dim; SmolLM2-360M's hidden=960 is not divisible by 128, forcing group_size=64 (~4.5 eff bpw) and losing quality. GLQ has no group-size constraint.

Method	bpw	5-task avg	% of bf16
bf16	16.0	0.557	100 %
GLQ 4-bit	4.0	0.555	99.6 %
GPTQ W4 (g64)	~4.5	0.486	87.2 %

5-task = ARC-e, HellaSwag, PIQA, WinoGrande, LAMBADA; 128 calibration samples; L40S. GPTQ's LAMBADA collapses to 0.346; GLQ preserves 0.508.

Throughput: SmolLM3-3B on vLLM

GLQ runs at near-bf16 throughput because compressed weights cut DRAM bandwidth enough to roughly offset the dequantization cost.

Method	bpw	Single req	Batch=5	vs bf16
bf16	16.0	39.4 tok/s	184 tok/s	100 %
GLQ 3.5bpw	3.5	37.1 tok/s	173 tok/s	94 %
GPTQ W4 (g128)	~4.5	34.6 tok/s	172 tok/s	88 %

vLLM 0.18.1, L40S.

How it works

E8 lattice codebook. 65,536 vectors from the first seven shells of the E8 lattice in 8 dimensions. Each 8-weight group of the weight matrix is encoded as one 16-bit index into this codebook (so the primary stage is 2 bpw). For 3–8 bpw, additional 8-bit (256-entry) or 16-bit (E8) residual codebooks refine the primary's reconstruction error.
Randomized Hadamard Transform. Random sign flips followed by Fast Walsh-Hadamard Transform rotate both weights and Hessian. After RHT the Hessian is approximately diagonal, so plain Euclidean nearest-neighbour in the codebook is near-optimal under the Hessian-weighted proxy loss.
LDLQ error feedback. Block-LDL decomposition of the Hessian drives a sequential sweep — GPTQ-style, but over 8-D blocks instead of scalar columns. Each block's quantization error propagates forward to correct downstream blocks.
Fused inference kernels. Custom CUDA C and Triton kernels read codebook indices from HBM, gather the 8-D vectors from the L2-cached 1 MB codebook, and accumulate the matmul directly — the dense weight matrix is never materialized. GPU memory savings scale with the compression ratio.

KV cache compression

GLQ ships two KV cache compressors. Either is opt-in — default behaviour is unchanged.

INT8 cache (HF transformers)

Per-channel absmax INT8 plus a small fp16 residual window for recent tokens — KIVI-style. Halves the KV memory at long context.

import glq.hf_integration
from glq.kv_cache import GLQQuantizedCache

cache = GLQQuantizedCache(model.config)
output = model.generate(**inputs, max_new_tokens=200,
                         past_key_values=cache)

Requires transformers >= 4.45. No external dependencies.

E8 lattice cache (vLLM, v0.3.0+)

Drops vLLM's paged KV cache to ~25 % of fp16 footprint using the same E8 lattice quantizer used for weights. Two fused Triton kernels (read-side dequant-gather, write-side scatter) keep decode within ~20 % of un-fused throughput.

Measured on Gemma-4-E4B-it, RTX PRO 6000 Blackwell, vLLM 0.20:

	fp16 baseline	E8 lattice
KV cache capacity @ 27.9 GiB	303,984 tokens	1,221,232 (4.02×) at `e8_relaxed:1`
mmlu_pro n=240 accuracy	71.25 %	71.25 % (bit-identical) at `e8_relaxed:2`
NIAH passkey @ ctx=16k / 32k / 64k / 130k	—	40/40 at `e8_relaxed:2` (full 128k window)
`cudaLaunchKernel` per decode	110,659	71,619 (−35 %) at `e8_relaxed:2`

Activation:

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
vllm serve google/gemma-4-E4B-it

The envs above use the workspace path: GLQ pre-decompresses the referenced K/V into a scratch buffer, then calls vLLM's stock attention. Because that buffer is built with a data-dependent block_table.unique(), glq auto-forces cudagraph_mode=PIECEWISE for this path (you'll see [glq_vllm] E8 KV active → cudagraph_mode forced ... to PIECEWISE at startup; --enforce-eager is no longer required as of v0.3.5). Weight-only GLQ still uses the default FULL_AND_PIECEWISE. The v0.5 inline-dequant path below lifts the PIECEWISE restriction and is the recommended path for long-context / KV-bound serving.

Validated end-to-end on Gemma-4-E4B-it / Gemma-4-31B-it on vLLM 0.20.x.

Inline-dequant E8 KV (default in v0.5.1)

The workspace path above pre-decompresses K/V into a scratch buffer that vLLM's attention then re-reads — pure overhead, since each K/V vector is read exactly once. The inline-dequant path instead dequantizes the compressed E8 K/V inside a forked Triton attention kernel (an 8-point FHT butterfly for the inverse Hadamard, plus flash-decoding KV-split for long-context occupancy). There is no workspace, and — because the read/write hooks are host-sync-clean — the FULL CUDA graph captures the whole decode, eliminating the per-token eager-dispatch overhead that dominated E8-KV decode.

As of v0.5.1 this is the default for the E8-KV path — the standard bundle is all you need (no extra flag):

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Opt out with GLQ_KV_E8_INLINE_DEQUANT_V3=0 (reverts to the 65 K workspace path) or GLQ_KV_E8_FORCE_PIECEWISE=1 (keeps inline but disables the FULL decode graph).

Decode throughput, SmolLM3-3B-GLQ-3.5bpw, RTX PRO 6000 Blackwell, vLLM 0.20.2 — inline vs the pre-v0.5 E8-KV path (workspace, PIECEWISE):

	E8 KV before v0.5	inline (v0.5)
decode B=1	~15 tok/s	38 (2.5×)
decode B=4	~37	127 (3.4×)
decode @ ctx=16k, B=1	~15	36 (2.4×)

The speedup is the FULL-graph capture the inline path unlocks; it brings E8-KV decode to roughly weight-only parity. On Gemma-4-E4B-it (large heads, already compute-bound) decode is roughly unchanged, but quality and long-context behaviour match.

Quality is neutral. On SmolLM3 the inline-FULL path is bit-identical to PIECEWISE (MMLU-Pro n=120 and NIAH-16k match exactly). On Gemma-4 it lands within vLLM's own run-to-run greedy non-determinism — MMLU-Pro n=120, thinking, 16384-token budget: PIECEWISE 0.742 vs inline-FULL 0.750 (a smaller gap than two PIECEWISE runs differ from each other), NIAH-16k 10/10 both.

Scope. It covers the 4 bpw KV recipe (e8_relaxed:2); other recipes automatically fall back to the workspace path. It requires the Triton attention backend (auto-forced when E8 KV is active). Validated across the consumer GPU lineup — A10G (sm_86, 3090-class, 24 GB), L40S (sm_89, 4090-class), and RTX PRO 6000 Blackwell (sm_120, 5090-class): the kernels compile and NIAH-16k + MMLU are correct on all three, and FULL-vs-PIECEWISE is quality-neutral on Blackwell (the consumer-card runs are shorter FULL-only smokes). Opt out per above.

Advanced

CUDA-graph decode wrapper

The B=1 autoregressive decode path is Python-dispatch-bound in eager mode. CUDAGraphWrapper captures the fixed-shape decode and replays it; benchmarks below are on SmolLM3-3B 3.5bpw, L40S.

Mode	GLQ 3.5 bpw	bf16
Eager	25 tok/s	40
CUDA graph	37 tok/s	40

from glq.cuda_graph import CUDAGraphWrapper
wrapper = CUDAGraphWrapper(model)
logits = wrapper(input_ids)   # first call captures; replays after

The wrapper falls back to eager for variable shapes (prefill, batch>1, extra kwargs). For 24B models the matmul is compute-bound at B=1, so graphs don't help (Devstral-24B GLQ 4 bpw: 6.6 tok/s eager vs 6.4 graphed).

Tuning vLLM CUDA-graph capture sizes (v0.3.4+)

vLLM 0.20 captures both FULL model-forward graphs (single replay per fixed shape) and PIECEWISE subgraphs split at attention. The default capture set is derived from max_num_seqs * 2, so a single-sequence harness only gets FULL captures for [1, 2]. For batched serving, raise the list explicitly:

from vllm import LLM
llm = LLM(model="xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw",
          compilation_config={
              "cudagraph_capture_sizes": [1, 2, 4, 8, 16],
          })

Measured impact on Gemma-4-E4B-it-GLQ-4bpw, RTX PRO 6000 Blackwell, 256-token decode:

Mode	B=1 tok/s	B=4 tok/s (total)
Eager	14.4	35.0
Piecewise + default capture `[1, 2]`	39.4	132.7
Piecewise + capture `[1, 2, 4, 8, 16]`	40.0	157.3 (+18.5 %)

At B=1 the FULL graph was already captured (no change). At B=4 the extended list keeps the FULL graph active where the default degenerated to PIECEWISE-only, recovering ~6 tok/s per sequence.

Cost: ~10-20 MB VRAM per captured shape on 3B / E4B models (vLLM prints the total at "Graph capturing finished in N s, took X GiB"). On 24-31B models budget ~100-200 MB per shape. Capture time is ~1 s per shape, one-time at LLM init.

Bit widths

bpw	Primary	Residual stages
2	16 b	—
3	16 b	+ 8 b
4	16 b	+ 16 b
5	16 b	+ 16 b + 8 b
6	16 b	+ 16 b + 16 b
7	16 b	+ 16 b + 16 b + 8 b
8	16 b	+ 16 b + 16 b + 16 b

One global scale per layer; no group-size parameter. Non-power-of-2 hidden sizes use block-diagonal FHT (v0.2.9+) — e.g. 2688 is decomposed as 2048 + 512 + 128 so on-disk storage matches the nominal rate exactly.

Serving with sglang

A fork of sglang with GLQ support lives at cnygaard/sglang on the glq-quantization branch. It registers "glq" as a quantization method and reuses the existing glq.inference_kernel CUDA extension as a runtime dependency.

git clone -b glq-quantization https://github.com/cnygaard/sglang
cd sglang/python && pip install -e .

python -m sglang.launch_server \
    --model xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw \
    --tokenizer-path HuggingFaceTB/SmolLM2-360M-Instruct \
    --quantization glq \
    --attention-backend triton --sampling-backend pytorch

Requires the triton attention backend (flashinfer returns wrong logprobs in echo/prefill mode). Default CUDA-graph capture is supported (v0.3.2+). If you hit a graph-break in a model architecture we haven't tested, pass --disable-piecewise-cuda-graph as a fallback.

Devstral-24B tokenizer

transformers 5.x auto-routes Mistral/Devstral models through mistral_common, which rejects the standard tokenizer.json. Use PreTrainedTokenizerFast explicitly:

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast

path = snapshot_download("xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw")
tok = PreTrainedTokenizerFast(tokenizer_file=f"{path}/tokenizer.json")
tok.pad_token, tok.eos_token, tok.bos_token = "<pad>", "</s>", "<s>"
model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
    device_map="cuda", dtype="float16",
)

examples/inference_hf.py includes a load_tokenizer() helper that handles this automatically.

transformers compatibility

For models ≤ 1B parameters use transformers >= 5.0. Transformers 4.57.x has a weight-loading bug that produces garbage output for small GLQ models. Larger models (3B+) work with both 4.x and 5.x.

Inference kernels

glq/inference_kernel.py + glq/csrc/glq_cuda.cu provide CUDA C and Triton kernels that compute Y = X @ dequant(W)^T without materializing the weight matrix. Each kernel iterates over N/8 codebook blocks per output row, gathers 8-D vectors from the L2-cached codebook, and accumulates the matmul directly against indices.

Path	When	Notes
CUDA C Tensor Core	B ≥ 2 (prefill)	inline PTX `mma.sync` against codebook-loaded registers; 3-5× faster than Triton
CUDA C split-K matvec	B = 1 (decode)	4 rows/warp + `__shfl_xor_sync` reduction; 2.7× faster than Triton
CUDA C shared-mem FHT	RHT step	double-buffered butterfly; 1.6-3× faster than Triton
Triton fallback	no `ninja`, or `n_pad > 32 768`	always available

Bit-exact determinism. Every kernel uses a scratch-buffer + fixed- order reduction instead of atomicAdd across k-splits, so running the same prompt at B=1 decode or B=8 prefill produces identical logits across runs — required for reproducible lm-eval scoring and on-policy RL rollouts.

Direct kernel access:

from glq.inference_kernel import glq_dequant_matmul
y = glq_dequant_matmul(x, Qidxs, codebook, Wscale,
                       Qidxs2=Qidxs2, codebook2=codebook2,
                       inv_resid_scale=inv_rs)  # 3/4 bpw two-stage

Architecture

glq/
  codebook.py          # E8ShellCodebook: enumeration, encode/decode
  hadamard.py          # Fast Walsh-Hadamard Transform
  rht.py               # Randomized Hadamard Transform
  ldlq.py              # Block-LDL quantization with error feedback
  quantize_model.py    # Full model pipeline + CLI
  quantized_linear.py  # E8RHTLinear: drop-in nn.Linear replacement
  inference_kernel.py  # Triton kernels + CUDA dispatch
  csrc/glq_cuda.cu     # CUDA C kernels (split-K matvec, TC, FHT)
  hf_integration.py    # HuggingFace Transformers integration
  kv_cache.py          # INT8 quantized KV cache
  cuda_graph.py        # B=1 decode wrapper
glq_vllm/              # vLLM integration: weight + KV cache (v0.3.0+)

Acknowledgments

Inspired by QuIP# (Tseng et al., 2024).

E8 lattice: Korkin & Zolotarev (1872); Gosset (1900); Conway & Sloane, Sphere Packings, Lattices and Groups; Viazovska (2016) — sphere-packing optimality in 8 dimensions.
Block-feedback quantization: GPTQ (Frantar et al., 2022).
INT8 KV cache: KIVI (Liu et al., 2024).

License

Apache 2.0

推荐订阅源

Hacker News - Newest: "LLM"