惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

The Hacker News
The Hacker News
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
雷峰网
雷峰网
人人都是产品经理
人人都是产品经理
Recent Announcements
Recent Announcements
D
DataBreaches.Net
P
Proofpoint News Feed
V
Visual Studio Blog
J
Java Code Geeks
Recorded Future
Recorded Future
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
F
Full Disclosure
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
The GitHub Blog
The GitHub Blog
Engineering at Meta
Engineering at Meta
C
Cybersecurity and Infrastructure Security Agency CISA
V
Vulnerabilities – Threatpost
罗磊的独立博客
Jina AI
Jina AI
博客园 - 【当耐特】
C
CERT Recently Published Vulnerability Notes
G
GRAHAM CLULEY
Y
Y Combinator Blog
L
LangChain Blog
L
LINUX DO - 热门话题
宝玉的分享
宝玉的分享
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
H
Help Net Security
云风的 BLOG
云风的 BLOG
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园_首页
A
About on SuperTechFans
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Latest news
Latest news
T
Threatpost
T
Tenable Blog
有赞技术团队
有赞技术团队
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Stack Overflow Blog
Stack Overflow Blog
C
Cisco Blogs
C
Check Point Blog
T
Tor Project blog
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
Schneier on Security
美团技术团队
I
Intezer
S
Securelist
AWS News Blog
AWS News Blog

Hacker News: Show HN

PurrrrrFocus: Pomodoro Timer App - App Store Workflow Engine — Multi-Step Orchestration for Bun RapidPhoto: Pro Photo Editor App - App Store GitHub - DheerG/swarms: Achieve extraordinary results with claude code across a variety of tasks SPICE simulation → oscilloscope → verification with Claude Code — Lucas Gerads Show HN: VCoding – A 5 MB native Windows IDE with no dynamic dependencies Show HN: LLMs don't hallucinate because they're bad at math, it's the format GitHub - Agent-FM/agentfm-core: AgentFM is a peer-to-peer network that turns everyday computers into a decentralized AI supercomputer. AgentFM lets you run massive AI workloads directly across a global mesh of idle CPUs and GPUs. Show HN: Tracking Top US Science Olympiad Alumni over Last 25 Years GitHub - Potarix/agent-hub: One place to talk to all your agents Show HN: Runtime security for AI agents(injection,tool abuse, data exfiltration) GitHub - dubeyKartikay/lazyspotify: Terminal Spotify client for macOS and Linux GitHub - the-banana-tool/king-louie: Easy to use GUI Personal AI Assistant. Win/Linux/Mac. Show HN I made my vacation rental bookable by AI agents–no Airbnb, 0% commission GitHub - basteez/jsf-autoreload: maven plugin to enable hot reload on jsf projects uvm32/hosts/host-gdbstub at main · ringtailsoftware/uvm32 GitHub - labsai/EDDI: Config-driven engine that turns JSON into production-grade AI agents. Multi-agent orchestration, 12+ LLM providers, MCP/A2A protocols, RAG, persistent memory, and enterprise compliance (EU AI Act, GDPR, HIPAA). Built on Quarkus. GitHub - glitchnsec/fortyone-oss: AI Executive Assistant Platform Quickstart | Alien GitHub - muxshed/shed: One stream in, or many. Every destination, simultaneously. No cloud middleman, no per-channel fees, no limits. GitHub - ocrbase-hq/ocrbase: 📄 PDF/IMG ->.MD/JSON Document OCR API for PaddleOCR and GLMOCR. Self-hostable. GitHub - impactjo/home-memory: MCP server that lets your AI assistant remember everything about your home. GitHub - Sets88/dbcls: DbCls is a powerful terminal database client that supports various databases GitHub - neptun2000/heor-agent-mcp GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh RollQuation: Math Puzzles - Apps on Google Play GitHub - dropbox/witchcraft Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis GitHub - opentalon/opentalon: OpenTalon is an open-source platform built from the ground up in Go as a robust alternative to OpenClaw LinkedIn™ 职位抓取工具 - Chrome 应用商店 GitHub - EdoardoBambini/Agent-Armor-Iaga: AI agents are getting tool access — shell, file system, databases, APIs, secrets. But **nobody is governing what they actually do with it**. Frameworks like LangChain, CrewAI, AutoGen, and Claude Code give agents the power to execute. Agent Armor gives you the power to control, audit, and approve every single action before it happens. HN Vibes — Week 15, Apr 7–13 2026 GitHub - chojs23/ec: Easy terminal-native 3-way git mergetool vim-like workflow GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. GitHub - JakOb-dotcom/cloud-sandbox-security-analysis: Technical analysis and Proof of Concept (PoC) regarding environment variable exfiltration in containerized cloud sandboxes via side-channel data leaks. Springboards - Flint Alpha Show HN: A simpler coding agent harness GitHub - audiodude/sudomake-friends GitHub - 256thFission/mini-mythos: OSS clone of Anthropic’s Mythos harness to locate C/C++ memory vulnerabilities Show HN: OpenParallax: OS-level privilege separation for AI agent execution Hacker News Sorted - Chrome 应用商店 Show HN: How to Install Docker on Ubuntu 24.04 LTS: Complete 2026 Guide GitHub - himanshudongre/smriti GitHub - sverrirsig/claude-control: macOS desktop dashboard for monitoring and managing multiple Claude Code sessions GitHub - ory/dockertest: Write better integration tests! Dockertest helps you boot up ephermal docker images for your Go tests with minimal work. Chiral - Chrome 应用商店 Show HN: Two Claudes collaborating through shared memory on a $100 mini-PC GitHub - pmichaillat/latex-cv: Minimalist LaTeX template for academic CVs GitHub - oguzbilgic/posse: A web UI for Anthropic Managed Agents. GitHub - sshiraz/depsly: Dependency risk analysis tool for npm packages ABI Add safari/agent-harness — Safari browser automation via safari-mcp by achiya-automation · Pull Request #212 · HKUDS/CLI-Anything GitHub - Halfblood-Prince/trustcheck: Verify PyPI package attestations and improve Python supply-chain security GitHub - oguzbilgic/kern-ai: Agents that do the work and show it. GitHub - bruits/satteri: High-performance Markdown and MDX processing for the JavaScript ecosystem GitHub - tylergibbs1/feedstock: High-performance web crawler and scraper for TypeScript, powered by Bun and Playwright GitHub - Grimm67123/grimmbot: The self-improving sandboxed and open-source AI agent. With persistent memory and scheduling. GitHub - whitevanillaskies/whitebloom: Local whiteboard that blooms. GitHub - hwdsl2/docker-whisper: Docker image for a self-hosted Whisper speech-to-text server with speaker diarization and OpenAI-compatible transcription and translation APIs. Powered by faster-whisper. Supports all Whisper models, NVIDIA GPU (CUDA) acceleration, JSON/SRT/VTT output, SSE streaming, offline mode, and multi-arch (amd64, arm64). GitHub - yisding/reviewwiggum GitHub - MarwanAlsoltany/serrors: Structured errors for Go: sentinel hierarchies, typed data, custom formatting, and slog integration. GitHub - soatok/age-php GitHub - Luthiraa/markitme GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits GitHub - tombedor/excalicharts GitHub - wh1le/excalidraw-edit: Open and edit .excalidraw files from the terminal. Offline, auto-saves to disk. MalExt Sentry - Malicious Extension Scanner - Chrome 应用商店 GitHub - syi0808/asciianimesvg: Generate animated ASCII art SVGs from text. CLI, Rust library, WASM, and web editor. GitHub - zaina-ml/ml_forge: A visual-based graph node editor for training computer vision models. GitHub - anakin87/llm-rl-environments-lil-course: 🌱 A little course on Reinforcement Learning Environments for evaluating and training Language Models GitHub - takaakit/superpowers-uml: Superpowers-UML modifies Superpowers to ensure a software development workflow in which AI agents design through UML modeling. AdriByte Studio - Sviluppo Web e Soluzioni Digitali GitHub - chouligi/angel-copilot: Your personalized Angel Investment Advisor Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 GitHub - agenteractai/lodmem: Level Of Detail Context Management for Agents GitHub - ostefani/subnetlens: A fast, concurrent network scanner with a TUI and plain-text CLI, built in Go. It discovers live hosts on your network, scans their open ports, resolves hostnames, and fingerprints operating systems—delivered. Cyber Pulse: Agentic Intel - Apps on Google Play Whisper API: Self-Hostable Speech to Text Transcription The Agent-Web Protocol Stack: A Research Thesis GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Show HN: Provepy – A Python decorator that proves your code using Lean and LLMs Show HN: Pardonned.com – A searchable database of US Pardons GitHub - patrickdappollonio/dux: Dux is a terminal UI that lets you run multiple AI coding agents side by side, each in its own git worktree, with full companion terminals, macros, commit generation, and a command palette that knows more tricks than you do. kMC Crystal Simulator Show HN: HyperFlow – A self-improving agent framework built on LangGraph GitHub - stef41/vibescore: 🎵 Grade your vibe-coded project. One command, instant letter grade across security, quality, dependencies, and testing. GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. imgur.com GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. GitHub - nowork-studio/toprank: Open-source Claude Code skills for SEO, SEM, Google Ads GitHub - tacomanator/sash: Lightweight macOS menu bar app for reliably cycling through windows of the current application. Appents | Social Media Management for Product-First Teams GitHub - pnhoang/youtube-spam-blocker: Automatically detects and hides spam messages in YouTube Live chat. Set rate limits, keyword filters, and block repeat offenders. GitHub - decisionnode/DecisionNode: CLI + Local MCP - A shared structured memory store across Claude Code, Cursor, Windsurf, Antigravity, and every MCP client. Semantically queryable. GitHub - AvaCodeSolutions/django-email-learning: An open source Django app for creating email-based learning platforms with IMAP integration and React frontend components. The $100K Gap in Kubernetes Security Tooling Function Calling Harness: From 6.75% to 100%
GitHub - cnygaard/glq: E8 lattice codebook quantization for LLM weights — 2/3/4 bpw with fused Triton inference kernel
acd · 2026-06-02 · via Hacker News: Show HN

Post-training weight quantization for LLMs using E8 lattice codebooks.

GLQ encodes each 8-weight group as a 16-bit index into a 65,536-entry E8 lattice codebook. A Randomized Hadamard Transform (RHT) decorrelates the Hessian so that Euclidean nearest-neighbour search is near-optimal under the proxy loss. The result: 2–8 bpw weights with quality comparable to QuIP# / better than GPTQ, and a fused CUDA kernel that matmuls directly against the compressed indices without materializing the weight matrix.

Quickstart

Run a pre-quantized model

pip install glq         # requires PyTorch ≥ 2.0

Python ≥ 3.10. Triton ships with PyTorch on CUDA and is used automatically. The CUDA C extension JIT-builds on first run (~30 s); CPU falls back to dequantize-then-matmul.

import glq.hf_integration  # registers GLQ with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw",
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw")
print(tok.decode(model.generate(
    **tok("The capital of France is", return_tensors="pt").to(model.device),
    max_new_tokens=20,
)[0], skip_special_tokens=True))

import glq.hf_integration registers quant_method="glq" with HF Transformers; from_pretrained then swaps nn.Linear for E8RHTLinear and uses the fused CUDA C kernel on inference. CPU falls back to a naive dequantize-then-matmul.

Available pre-quantized checkpoints

Repo Base model bpw License VRAM¹ Tok/s² (b1 / b32)
xv0y5ncu/SmolLM2-135M-Instruct-GLQ-4bpw SmolLM2-135M-Instruct 4.0 Apache 2.0 0.18 152 / 4205
xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw SmolLM2-360M-Instruct 4.0 Apache 2.0 0.33 135 / 2990
xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw SmolLM3-3B 3.5 (mixed) Apache 2.0 2.4 35 / 654
xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw Gemma-4-E4B-it 4.0 Apache 2.0 5.8 33 / 600
xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw Devstral-Small 24B 4.0 Apache 2.0 ~20.5 6.6 / —
xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw Nemotron-3-Nano-30B (Mamba-MoE) 4.0 Nemotron

¹ VRAM = resident weight footprint at load (vLLM's Model loading took … GiB) on a g6e.xlarge (NVIDIA L40S) — the figure that decides whether a model fits a 24/32 GB card; it tracks the bpw budget. Devstral ≈ 20.5 GiB is its HF-transformers load; Nemotron-30B not measured here.
² Tok/s = total decode throughput, weight-only GLQ, vLLM 0.20.2, short context (256 generated tokens), same hardware. b1 = single-stream, b32 = 32 concurrent sequences — a high-batch sample near the throughput knee, not a hard maximum. Devstral-24B is HF-transformers single-stream (vLLM v1 deadlocks; no batched figure; see CUDA-graph decode wrapper). Nemotron-3-Nano-30B is a Mamba-MoE (vLLM-unsupported here, compute-bound) — not benchmarked. E8-KV compression leaves these short-context numbers unchanged; its payoff is a ~4× smaller KV cache → more context / concurrency in the same VRAM (see KV cache compression).

Quantize your own model

pip install 'glq[quantize]'    # adds transformers, datasets, etc.

glq-quantize \
    --model HuggingFaceTB/SmolLM2-360M \
    --output ./smollm2-glq-4bpw \
    --bpw 4 \
    --nsamples 128 \
    --device cuda

Other bit-widths: pass --bpw 2 through --bpw 8 (fractional like 2.5 also works). glq-quantize --help lists every flag. For models that don't fit in system RAM use --streaming (loads one layer at a time from safetensors).

For mixed-precision allocation, run a two-pass flow: a profile pass writes a per-layer bpw_allocation.json, then a quantize pass applies it. See examples/quantize_mixed_precision.md.

Docker image (NVIDIA GPU)

A prebuilt CUDA image ships everything needed to run GLQ models — glq, PyTorch, vLLM, transformers, and lm-eval on CUDA 12.8:

ghcr.io/cnygaard/glq-env:0.5.0     # also :latest, :0.5

Prerequisite — GPU access in Docker. You need an NVIDIA GPU plus the NVIDIA Container Toolkit installed on the host; that's what makes the --gpus all flag pass the GPU into the container. Verify it works:

docker run --rm --gpus all ghcr.io/cnygaard/glq-env:0.5.0 nvidia-smi

If that prints your GPU table, you're set. (No toolkit → --gpus errors with "could not select device driver".)

Produce output. Mount a host directory for the model cache (the image's HF_HOME is /cache/hf, so models persist across runs instead of re-downloading), then generate:

docker run --rm --gpus all \
    -v "$HOME/.cache/huggingface:/cache/hf" \
    ghcr.io/cnygaard/glq-env:0.5.0 \
    python -c '
import glq.hf_integration, torch                      # registers GLQ with HF
from transformers import AutoModelForCausalLM, AutoTokenizer
mid = "xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
    mid, device_map="cuda", torch_dtype=torch.float16)
ids = tok("The capital of France is", return_tensors="pt").to("cuda")
print(tok.decode(model.generate(**ids, max_new_tokens=20)[0],
                 skip_special_tokens=True))
'

Expected output:

The capital of France is Paris. It is located in the north of the country.

The first run downloads the model into the mounted cache; later runs reuse it. Swap mid for any GLQ checkpoint (see Available pre-quantized checkpoints).

Flag reference:

Flag Why
--gpus all binds all host GPUs into the container (needs the NVIDIA Container Toolkit). Use --gpus '"device=0"' to pick one.
-v "$HOME/.cache/huggingface:/cache/hf" persists downloaded weights on the host (HF_HOME=/cache/hf inside) so they survive --rm.
--rm remove the container when it exits (drop it to keep the container around).

Serving (vLLM) & an interactive shell. The image bundles vLLM, so you can serve an OpenAI-compatible endpoint — publish the port and mount the cache:

# Plain chat — the model ships its own chat template, so nothing extra needed:
docker run --rm --gpus all -p 8000:8000 \
    -v "$HOME/.cache/huggingface:/cache/hf" \
    ghcr.io/cnygaard/glq-env:latest \
    vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw --max-model-len 64000

Tool-calling + thinking. Gemma-4's tool template is not in the model (its bundled chat_template.jinja is plain chat) and not in the vLLM pip wheel, so fetch it from vLLM's examples/ first, then mount it:

curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/v0.20.2/examples/tool_chat_template_gemma4.jinja \
    -o tool_chat_template_gemma4.jinja

docker run --rm --gpus all -p 8000:8000 \
    -v "$HOME/.cache/huggingface:/cache/hf" \
    -v "$PWD/tool_chat_template_gemma4.jinja:/work/tool.jinja:ro" \
    ghcr.io/cnygaard/glq-env:latest \
    vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw \
        --max-model-len 64000 \
        --enable-auto-tool-choice \
        --tool-call-parser gemma4 \
        --reasoning-parser gemma4 \
        --chat-template /work/tool.jinja \
        --default-chat-template-kwargs '{"enable_thinking": true}'

Pass --chat-template the in-container mount path (/work/tool.jinja), not the host path. --default-chat-template-kwargs '{"enable_thinking": true}' defaults Gemma-4 reasoning on. The gemma4 parsers and all of these flags are accepted by the image's bundled vLLM 0.20.2 and the model loads; note that startup runs a multi-minute torch.compile + CUDA-graph capture before the endpoint is ready. See the vLLM Gemma-4 recipe for the full tool-calling / reasoning reference.

Image vLLM note: in-image vLLM serving needs an image built from the v0.5.2 Dockerfile fix or later (vLLM now resolves its own matching-CUDA torch). The published :0.5.0 / :0.5.1 snapshots predate that fix and hit ImportError: libcudart.so.13 under --gpus all; on those, use the HF generate path above (works), or pip install glq + your own vLLM. The pip package is unaffected on all versions.

For the long-context E8 KV-cache flags (GLQ_KV_*), pass them with -e and see E8 lattice cache / Inline-dequant E8 KV. The image's default command is a shell (docker run --rm -it --gpus all ghcr.io/cnygaard/glq-env:latest) if you'd rather poke around interactively.

Results

SmolLM3-3B at matched 4.5 bpw vs GPTQ

Blackwell RTX PRO 6000, 128 calibration samples, lm-evaluation-harness limit=200/task (GSM8K n=500, MMLU 50/subtask). GLQ 4.5 bpw uses two-pass mixed allocation (91 layers @ 4 bpw + 161 @ 5 bpw, avg 4.64 bpw).

Task bf16 GLQ 4.5 bpw GPTQ W4 g128
ARC-challenge (acc_n) 0.490 0.475 0.420
ARC-easy (acc_n) 0.745 0.735 0.695
HellaSwag (acc_n) 0.660 0.660 0.675
MMLU (acc) 0.617 0.603 0.589
TruthfulQA mc2 0.529 0.545 0.515
WinoGrande 0.655 0.660 0.670
WikiText-2 ppl ↓ 10.67 10.90 11.33
GSM8K flex (n=500) 0.722 0.738 0.688
IFEval prompt-strict 0.310 0.310 0.285
IFEval prompt-loose 0.325 0.330 0.295
IFEval inst-strict 0.478 0.472 0.453
IFEval inst-loose 0.494 0.491 0.469

GLQ beats GPTQ on 10/12 metrics. WikiText-2 ppl gap to bf16: +2.2 % (GLQ) vs +6.2 % (GPTQ). GSM8K flex matches bf16; GPTQ drops 0.034.

Small models: SmolLM2-360M-Instruct at 4 bpw

GPTQ requires a group-size dividing the hidden dim; SmolLM2-360M's hidden=960 is not divisible by 128, forcing group_size=64 (~4.5 eff bpw) and losing quality. GLQ has no group-size constraint.

Method bpw 5-task avg % of bf16
bf16 16.0 0.557 100 %
GLQ 4-bit 4.0 0.555 99.6 %
GPTQ W4 (g64) ~4.5 0.486 87.2 %

5-task = ARC-e, HellaSwag, PIQA, WinoGrande, LAMBADA; 128 calibration samples; L40S. GPTQ's LAMBADA collapses to 0.346; GLQ preserves 0.508.

Throughput: SmolLM3-3B on vLLM

GLQ runs at near-bf16 throughput because compressed weights cut DRAM bandwidth enough to roughly offset the dequantization cost.

Method bpw Single req Batch=5 vs bf16
bf16 16.0 39.4 tok/s 184 tok/s 100 %
GLQ 3.5bpw 3.5 37.1 tok/s 173 tok/s 94 %
GPTQ W4 (g128) ~4.5 34.6 tok/s 172 tok/s 88 %

vLLM 0.18.1, L40S.

How it works

  1. E8 lattice codebook. 65,536 vectors from the first seven shells of the E8 lattice in 8 dimensions. Each 8-weight group of the weight matrix is encoded as one 16-bit index into this codebook (so the primary stage is 2 bpw). For 3–8 bpw, additional 8-bit (256-entry) or 16-bit (E8) residual codebooks refine the primary's reconstruction error.

  2. Randomized Hadamard Transform. Random sign flips followed by Fast Walsh-Hadamard Transform rotate both weights and Hessian. After RHT the Hessian is approximately diagonal, so plain Euclidean nearest-neighbour in the codebook is near-optimal under the Hessian-weighted proxy loss.

  3. LDLQ error feedback. Block-LDL decomposition of the Hessian drives a sequential sweep — GPTQ-style, but over 8-D blocks instead of scalar columns. Each block's quantization error propagates forward to correct downstream blocks.

  4. Fused inference kernels. Custom CUDA C and Triton kernels read codebook indices from HBM, gather the 8-D vectors from the L2-cached 1 MB codebook, and accumulate the matmul directly — the dense weight matrix is never materialized. GPU memory savings scale with the compression ratio.

KV cache compression

GLQ ships two KV cache compressors. Either is opt-in — default behaviour is unchanged.

INT8 cache (HF transformers)

Per-channel absmax INT8 plus a small fp16 residual window for recent tokens — KIVI-style. Halves the KV memory at long context.

import glq.hf_integration
from glq.kv_cache import GLQQuantizedCache

cache = GLQQuantizedCache(model.config)
output = model.generate(**inputs, max_new_tokens=200,
                         past_key_values=cache)

Requires transformers >= 4.45. No external dependencies.

E8 lattice cache (vLLM, v0.3.0+)

Drops vLLM's paged KV cache to ~25 % of fp16 footprint using the same E8 lattice quantizer used for weights. Two fused Triton kernels (read-side dequant-gather, write-side scatter) keep decode within ~20 % of un-fused throughput.

Measured on Gemma-4-E4B-it, RTX PRO 6000 Blackwell, vLLM 0.20:

fp16 baseline E8 lattice
KV cache capacity @ 27.9 GiB 303,984 tokens 1,221,232 (4.02×) at e8_relaxed:1
mmlu_pro n=240 accuracy 71.25 % 71.25 % (bit-identical) at e8_relaxed:2
NIAH passkey @ ctx=16k / 32k / 64k / 130k 40/40 at e8_relaxed:2 (full 128k window)
cudaLaunchKernel per decode 110,659 71,619 (−35 %) at e8_relaxed:2

Activation:

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
vllm serve google/gemma-4-E4B-it

The envs above use the workspace path: GLQ pre-decompresses the referenced K/V into a scratch buffer, then calls vLLM's stock attention. Because that buffer is built with a data-dependent block_table.unique(), glq auto-forces cudagraph_mode=PIECEWISE for this path (you'll see [glq_vllm] E8 KV active → cudagraph_mode forced ... to PIECEWISE at startup; --enforce-eager is no longer required as of v0.3.5). Weight-only GLQ still uses the default FULL_AND_PIECEWISE. The v0.5 inline-dequant path below lifts the PIECEWISE restriction and is the recommended path for long-context / KV-bound serving.

Validated end-to-end on Gemma-4-E4B-it / Gemma-4-31B-it on vLLM 0.20.x.

Inline-dequant E8 KV (default in v0.5.1)

The workspace path above pre-decompresses K/V into a scratch buffer that vLLM's attention then re-reads — pure overhead, since each K/V vector is read exactly once. The inline-dequant path instead dequantizes the compressed E8 K/V inside a forked Triton attention kernel (an 8-point FHT butterfly for the inverse Hadamard, plus flash-decoding KV-split for long-context occupancy). There is no workspace, and — because the read/write hooks are host-sync-clean — the FULL CUDA graph captures the whole decode, eliminating the per-token eager-dispatch overhead that dominated E8-KV decode.

As of v0.5.1 this is the default for the E8-KV path — the standard bundle is all you need (no extra flag):

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Opt out with GLQ_KV_E8_INLINE_DEQUANT_V3=0 (reverts to the 65 K workspace path) or GLQ_KV_E8_FORCE_PIECEWISE=1 (keeps inline but disables the FULL decode graph).

Decode throughput, SmolLM3-3B-GLQ-3.5bpw, RTX PRO 6000 Blackwell, vLLM 0.20.2 — inline vs the pre-v0.5 E8-KV path (workspace, PIECEWISE):

E8 KV before v0.5 inline (v0.5)
decode B=1 ~15 tok/s 38 (2.5×)
decode B=4 ~37 127 (3.4×)
decode @ ctx=16k, B=1 ~15 36 (2.4×)

The speedup is the FULL-graph capture the inline path unlocks; it brings E8-KV decode to roughly weight-only parity. On Gemma-4-E4B-it (large heads, already compute-bound) decode is roughly unchanged, but quality and long-context behaviour match.

Quality is neutral. On SmolLM3 the inline-FULL path is bit-identical to PIECEWISE (MMLU-Pro n=120 and NIAH-16k match exactly). On Gemma-4 it lands within vLLM's own run-to-run greedy non-determinism — MMLU-Pro n=120, thinking, 16384-token budget: PIECEWISE 0.742 vs inline-FULL 0.750 (a smaller gap than two PIECEWISE runs differ from each other), NIAH-16k 10/10 both.

Scope. It covers the 4 bpw KV recipe (e8_relaxed:2); other recipes automatically fall back to the workspace path. It requires the Triton attention backend (auto-forced when E8 KV is active). Validated across the consumer GPU lineup — A10G (sm_86, 3090-class, 24 GB), L40S (sm_89, 4090-class), and RTX PRO 6000 Blackwell (sm_120, 5090-class): the kernels compile and NIAH-16k + MMLU are correct on all three, and FULL-vs-PIECEWISE is quality-neutral on Blackwell (the consumer-card runs are shorter FULL-only smokes). Opt out per above.

Advanced

CUDA-graph decode wrapper

The B=1 autoregressive decode path is Python-dispatch-bound in eager mode. CUDAGraphWrapper captures the fixed-shape decode and replays it; benchmarks below are on SmolLM3-3B 3.5bpw, L40S.

Mode GLQ 3.5 bpw bf16
Eager 25 tok/s 40
CUDA graph 37 tok/s 40
from glq.cuda_graph import CUDAGraphWrapper
wrapper = CUDAGraphWrapper(model)
logits = wrapper(input_ids)   # first call captures; replays after

The wrapper falls back to eager for variable shapes (prefill, batch>1, extra kwargs). For 24B models the matmul is compute-bound at B=1, so graphs don't help (Devstral-24B GLQ 4 bpw: 6.6 tok/s eager vs 6.4 graphed).

Tuning vLLM CUDA-graph capture sizes (v0.3.4+)

vLLM 0.20 captures both FULL model-forward graphs (single replay per fixed shape) and PIECEWISE subgraphs split at attention. The default capture set is derived from max_num_seqs * 2, so a single-sequence harness only gets FULL captures for [1, 2]. For batched serving, raise the list explicitly:

from vllm import LLM
llm = LLM(model="xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw",
          compilation_config={
              "cudagraph_capture_sizes": [1, 2, 4, 8, 16],
          })

Measured impact on Gemma-4-E4B-it-GLQ-4bpw, RTX PRO 6000 Blackwell, 256-token decode:

Mode B=1 tok/s B=4 tok/s (total)
Eager 14.4 35.0
Piecewise + default capture [1, 2] 39.4 132.7
Piecewise + capture [1, 2, 4, 8, 16] 40.0 157.3 (+18.5 %)

At B=1 the FULL graph was already captured (no change). At B=4 the extended list keeps the FULL graph active where the default degenerated to PIECEWISE-only, recovering ~6 tok/s per sequence.

Cost: ~10-20 MB VRAM per captured shape on 3B / E4B models (vLLM prints the total at "Graph capturing finished in N s, took X GiB"). On 24-31B models budget ~100-200 MB per shape. Capture time is ~1 s per shape, one-time at LLM init.

Bit widths

bpw Primary Residual stages
2 16 b
3 16 b + 8 b
4 16 b + 16 b
5 16 b + 16 b + 8 b
6 16 b + 16 b + 16 b
7 16 b + 16 b + 16 b + 8 b
8 16 b + 16 b + 16 b + 16 b

One global scale per layer; no group-size parameter. Non-power-of-2 hidden sizes use block-diagonal FHT (v0.2.9+) — e.g. 2688 is decomposed as 2048 + 512 + 128 so on-disk storage matches the nominal rate exactly.

Serving with sglang

A fork of sglang with GLQ support lives at cnygaard/sglang on the glq-quantization branch. It registers "glq" as a quantization method and reuses the existing glq.inference_kernel CUDA extension as a runtime dependency.

git clone -b glq-quantization https://github.com/cnygaard/sglang
cd sglang/python && pip install -e .

python -m sglang.launch_server \
    --model xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw \
    --tokenizer-path HuggingFaceTB/SmolLM2-360M-Instruct \
    --quantization glq \
    --attention-backend triton --sampling-backend pytorch

Requires the triton attention backend (flashinfer returns wrong logprobs in echo/prefill mode). Default CUDA-graph capture is supported (v0.3.2+). If you hit a graph-break in a model architecture we haven't tested, pass --disable-piecewise-cuda-graph as a fallback.

Devstral-24B tokenizer

transformers 5.x auto-routes Mistral/Devstral models through mistral_common, which rejects the standard tokenizer.json. Use PreTrainedTokenizerFast explicitly:

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast

path = snapshot_download("xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw")
tok = PreTrainedTokenizerFast(tokenizer_file=f"{path}/tokenizer.json")
tok.pad_token, tok.eos_token, tok.bos_token = "<pad>", "</s>", "<s>"
model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
    device_map="cuda", dtype="float16",
)

examples/inference_hf.py includes a load_tokenizer() helper that handles this automatically.

transformers compatibility

For models ≤ 1B parameters use transformers >= 5.0. Transformers 4.57.x has a weight-loading bug that produces garbage output for small GLQ models. Larger models (3B+) work with both 4.x and 5.x.

Inference kernels

glq/inference_kernel.py + glq/csrc/glq_cuda.cu provide CUDA C and Triton kernels that compute Y = X @ dequant(W)^T without materializing the weight matrix. Each kernel iterates over N/8 codebook blocks per output row, gathers 8-D vectors from the L2-cached codebook, and accumulates the matmul directly against indices.

Path When Notes
CUDA C Tensor Core B ≥ 2 (prefill) inline PTX mma.sync against codebook-loaded registers; 3-5× faster than Triton
CUDA C split-K matvec B = 1 (decode) 4 rows/warp + __shfl_xor_sync reduction; 2.7× faster than Triton
CUDA C shared-mem FHT RHT step double-buffered butterfly; 1.6-3× faster than Triton
Triton fallback no ninja, or n_pad > 32 768 always available

Bit-exact determinism. Every kernel uses a scratch-buffer + fixed- order reduction instead of atomicAdd across k-splits, so running the same prompt at B=1 decode or B=8 prefill produces identical logits across runs — required for reproducible lm-eval scoring and on-policy RL rollouts.

Direct kernel access:

from glq.inference_kernel import glq_dequant_matmul
y = glq_dequant_matmul(x, Qidxs, codebook, Wscale,
                       Qidxs2=Qidxs2, codebook2=codebook2,
                       inv_resid_scale=inv_rs)  # 3/4 bpw two-stage

Architecture

glq/
  codebook.py          # E8ShellCodebook: enumeration, encode/decode
  hadamard.py          # Fast Walsh-Hadamard Transform
  rht.py               # Randomized Hadamard Transform
  ldlq.py              # Block-LDL quantization with error feedback
  quantize_model.py    # Full model pipeline + CLI
  quantized_linear.py  # E8RHTLinear: drop-in nn.Linear replacement
  inference_kernel.py  # Triton kernels + CUDA dispatch
  csrc/glq_cuda.cu     # CUDA C kernels (split-K matvec, TC, FHT)
  hf_integration.py    # HuggingFace Transformers integration
  kv_cache.py          # INT8 quantized KV cache
  cuda_graph.py        # B=1 decode wrapper
glq_vllm/              # vLLM integration: weight + KV cache (v0.3.0+)

Acknowledgments

Inspired by QuIP# (Tseng et al., 2024).

  • E8 lattice: Korkin & Zolotarev (1872); Gosset (1900); Conway & Sloane, Sphere Packings, Lattices and Groups; Viazovska (2016) — sphere-packing optimality in 8 dimensions.
  • Block-feedback quantization: GPTQ (Frantar et al., 2022).
  • INT8 KV cache: KIVI (Liu et al., 2024).

License

Apache 2.0