惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LangChain Blog
宝玉的分享
宝玉的分享
酷 壳 – CoolShell
酷 壳 – CoolShell
N
Netflix TechBlog - Medium
F
Fortinet All Blogs
T
Tailwind CSS Blog
Google DeepMind News
Google DeepMind News
Jina AI
Jina AI
J
Java Code Geeks
Recent Announcements
Recent Announcements
The Cloudflare Blog
D
DataBreaches.Net
Hugging Face - Blog
Hugging Face - Blog
WordPress大学
WordPress大学
Vercel News
Vercel News
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Microsoft Azure Blog
Microsoft Azure Blog
雷峰网
雷峰网
H
Help Net Security
博客园 - Franky
S
SegmentFault 最新的问题
T
The Blog of Author Tim Ferriss
博客园_首页
C
Check Point Blog
腾讯CDC
美团技术团队
Martin Fowler
Martin Fowler
The GitHub Blog
The GitHub Blog
M
MIT News - Artificial intelligence
Apple Machine Learning Research
Apple Machine Learning Research
P
Proofpoint News Feed
U
Unit 42
人人都是产品经理
人人都是产品经理
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Engineering at Meta
Engineering at Meta
M
Microsoft Research Blog - Microsoft Research
阮一峰的网络日志
阮一峰的网络日志
G
Google Developers Blog
Stack Overflow Blog
Stack Overflow Blog
B
Blog
Last Week in AI
Last Week in AI
博客园 - 三生石上(FineUI控件)
博客园 - 聂微东
云风的 BLOG
云风的 BLOG
H
Hackread – Cybersecurity News, Data Breaches, AI and More
李成银的技术随笔
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知

DEV Community

Local RAG: Chat With Your Documents (Open Source, Private) What Excited Me Most at Google I/O 2026 OSS assemble! Kilo Code is launching on Product Hunt. Join the launch! https://www.producthunt.com/products/kilocode Your Organizational AI Adoption Metrics Are Lying (Plus How to Measure Real Adoption) Building a Production-Grade MLOps Home Lab on Windows — K8s, LLM, RAG & GitLab CI The Moment I Realized AI Agents are Changing Software Forever Prisma Generator NestJS DTO — pluggable DTOs with annotations and custom generators I Spent a Month Testing Decentralized Poker Sites. Here's What Actually Works. DeepSeek-R1: The $0 o1 Alternative You Can Run Right Now The PHP Stack I Built TrustGate On — And Why I'd Do It Differently Today Building High-Throughput Data Pipelines: Why Chaining Encryption and Compression is a Performance Killer Optic is dead. A 2026 migration guide for OpenAPI breaking changes Smart Blind Stick, Mini Project The NSA just published an MCP security playbook. We created Agent Trust Transport Protocol ATTP - Implement today with MCPS Symfony 8 AWS Secrets Bundle Canlı TV Platformu Geliştirirken Öğrendiğim Teknik Dersler: Streaming, Flussonic ve Performans Gemma 4 Is Powerful — But Production AI Still Needs Governance What RepoSignal Surfaced in React — and Why Review Alone Doesn't Catch Everything LeetCode Solution: 1752. Check if Array Is Sorted and Rotated Breaking the Matrix at 15: How I Built a Cyber-Aesthetic AI Assistant Core Powered by Gemma 4 Разработка Android Kiosk приложения No More Manual Test Writing: How I Used Gemma 4 to Turn a GitHub Repo Into a Full Test Suite 🎯 Trafik Cezaları Platformları Geliştirirken Öğrendiğim Teknik Dersler The Myth of Low Latency: Why Event Meshes Make Your System Slow Building EIDOLON OS — A Local-First AI Cognitive Operating System qrrot - database with AI I Built a Local Gemma 4 Reviewer for Merchant Registry Evidence Compass v1.1.0 · we shipped a memory plugin that catches its own consumption drift How to build your first MCP server in 10 minutes Expo SDK 56 Is Out, and a Few Things Finally Clicked Into Place Building a 100ms Browser-Native WebSocket Clipboard Cómo solucionar `docker run` con `Exited (1)` en Raspberry Pi Why Claude Code Sessions Diverge: A Mechanism Catalog When One AI Agent Is Not Enough: A Practical Delegation Pattern for Enterprise Systems Cómo solucionar el bucle infinito en `useEffect` con objetos y arrays 🛢️ The Dangote Chain: What a Blockchain-Native Refinery IPO Would Look Like Build a "Where to Watch" feature in 50 lines with the StreamWatchHub API Gemma 4 on Android: Tricks for Faster On-Device Inference Your AI agent has amnesia. You've just normalized it. 🚀 Reviving My Women Safety System – From Idea to Real-Time Smart Safety Solution I built an AI that reviews every PR automatically (because nobody was reviewing mine) 🌿 Git Mastery: The Complete Developer Guide Bringing Gemma 4 E2B to the Edge: Building a Privacy-First Dream Analyzer with Flutter & LiteRT Google I/O 2026 Wasn’t About Features — It Was About AI Becoming the Developer Environment Building an AI Vedic Astrology App in 25 Days — What Actually Worked (and What Didn't) Hermes Agent Has Four Memories — And That's Why It Doesn't Forget You Pressure Isn't Killing You -Your Relationship With It Is 🐳 How to Run Any Project in Docker: A Complete Guide AccessLens — a blind person's lanyard, powered by Gemma 4 on-device Glyph v0.2: the release is the joinery How I Built a Blazingly Fast, Privacy-First Batch Image Converter in the Browser Using OPFS and Web Workers Cómo solucionar \"Text content does not match server-rendered HTML\" en Next.js App Router FCoP 3.0: Why AI Agents Need a Track, Not a Brake Fibonacci: Quiz app which anyone can make revenue by viewing ads to the quiz contestants. The Subconscious Powered by Edge AI GPU Utilization Is Becoming the New Cloud Waste Crisis Cómo solucionar `docker run` con exit code 1 en Raspberry Pi JWT is a scam and your app doesn't need it 7 Agent Skill Packs That Actually Make AI Coders Better More Control, More Cost: Why Commanding AI Isn't Delegation SecureScan Synthadoc: We Built an AI Judge for Our AI Wiki Compiler - Here's What We Learned Cómo solucionar el error de permiso al ejecutar `pip.exe` en entorno virtual (Python 3.10 en Windows) Postgres-grade Serializable at 20k+ ops/s — on a laptop. Don’t try this at home. Pure Core, Imperative Shell in Rust with Stillwater Lean 4 for Programmers: Building a Todo List with Proof Trustless Bug Bounty Releases with a PoW-Gated DLC Oracle Building Autonomous DevOps Agents with MCP and LangChain Multimodal Gemma 4 Visual Regression & Patch Agent Git Time Machine — How Version Control Can Save Your Project My Dad Got an Electricity Bill He Couldn't Understand. Google I/O 2026 Just Made That Problem Solvable. My Dad Got an Electricity Bill He Couldn't Understand. Google I/O 2026 Just Made That Problem Solvable. Read Replicas Lie About Consistency. 4 Sync Modes Behind the Lie. Reviving My Coding Project with GitHub Copilot I Tried Gemini 3.5 Flash After Google I/O 2026 - Here is What I Found :)) Zero-Cost AI in VS Code Blueprints Might Be More Important Than Frameworks AI CareCompanion - Offline Health Assistant Long-Context Models Killed RAG. Except for the 6 Cases Where They Made It Worse. I Built a Neural Network Engine in C# That Runs in Your Browser - No ONNX Runtime, No JavaScript Bridge, No Native Binaries An In-Depth Overview of the Apache Iceberg 1.11.0 Release Your Agent Just Called the Same Tool 47 Times. Here's the 20-Line Detector. How I Built a Multi-System Astrology Bot in Python (And What Meta Banned Me For) Gemma 4 Has Four Variants. Here's How to Pick the Right One Before You Write a Single Line of Code. Log Level Strategies: Balancing Observability and Cost Why WebMCP Is the Most Important Thing Google Announced at I/O 2026 (And Nobody's Talking About It) Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch Google's 2x Energy Efficiency Claim Is Real — But Here's What They're Not Measuring What's actually going on with CORS, under the hood Language-Agnostic Code Generation: The Driver Plugin Model Why We Rewrote Our Python CLI in Go (and What We Gained) I added up everything Google gives developers for free after I/O 2026. It's kind of absurd The Dawn of Smarter Apps: My Take on Google I/O 2026 AI Announcements Why AI Agents Like Hermes Need a Semantic Execution Layer for the Physical World Why We Built TestSmith: The Test Coverage Problem Nobody Talks About How to Convert Bank Statement PDFs to Excel: The Complete 2026 Guide Have You Ever Used a Website That Keeps Working After You Turn Off Your Internet? From idea to indexed: how I launched a SaaS in 60 days with Laravel + React Building a local-first AI tutor for my daughter (and 10–14 year-olds in Austrian schools) with Gemma 4 EC2 SSH Not Connecting? Here Are the 5 Things That Were Wrong (And How I Fixed Them)
GGUF & Modelfile: The Power User's Guide to Local LLMs
Lingdas1 · 2026-05-24 · via DEV Community

Lingdas1

GGUF & Modelfile: The Power User's Guide to Local LLMs

Beyond ollama pull — download any model from Hugging Face, quantize it, customize it, and import it into Ollama.

What's GGUF?

GGUF (GPT-Generated Unified Format) is the standard file format for running LLMs locally. Think of it as the .mp3 of AI models:

  • Compressed — 70-85% smaller than the original float16 weights
  • Fast — optimized for CPU and GPU inference
  • Portable — one file contains the entire model
  • Metadata-rich — includes tokenizer, chat template, and model config

Every ollama pull downloads a GGUF file under the hood. But the real power move is downloading GGUF files directly from Hugging Face and importing them yourself.

Quantization Analogy (Steal This)

Quantization is like JPEG compression for AI models. A RAW photo is 50MB. A JPEG of the same photo is 5MB — 90% smaller, but it still looks 95% as good. That's what Q4_K_M quantization does to a model: 70% smaller, 96% of the intelligence.


Step 1: Finding the Right GGUF File

The Golden Rule

Always look for Q4_K_M — it's the sweet spot of size vs quality for almost every model.

Where to Find GGUFs

Source URL Best For
Official provider huggingface.co/Qwen etc. Trustworthy, but often only Q8/Q6
Unsloth huggingface.co/unsloth Best selection of quants (Q2-Q8)
Bartowski huggingface.co/bartowski Massive library, every quantization
MaziyarPanahi huggingface.co/MaziyarPanahi Merged models, niche architectures

The GGUF Filename Decoder

Qwen2.5-14B-Q4_K_M.gguf
├── Model name      ├── Size   └── Quantization

Enter fullscreen mode Exit fullscreen mode

Quant Code Compression Quality Use Case
Q8_0 50% 99% When you have VRAM to spare
Q6_K 60% 98% High-quality, reasonable size
Q4_K_M 70% 96% 🟢 Sweet spot — use this
Q3_K_M 78% 92% When VRAM is tight
Q2_K 85% 85% Emergency only — quality noticeably drops
IQ4_XS 72% 95% Experimental import format

Step 2: Download & Import a GGUF

Basic Import

# 1. Download Q4_K_M of Qwen 2.5-14B
wget https://huggingface.co/bartowski/Qwen2.5-14B-GGUF/resolve/main/Qwen2.5-14B-Q4_K_M.gguf

# 2. Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen2.5-14B-Q4_K_M.gguf
EOF

# 3. Import into Ollama
ollama create my-custom-model -f Modelfile

# 4. Run it
ollama run my-custom-model

Enter fullscreen mode Exit fullscreen mode

Smart Import (with Optimized Settings)

cat > Modelfile << 'EOF'
FROM ./DeepSeek-R1-14B-Q4_K_M.gguf

# Performance tuning
PARAMETER num_ctx 32768
PARAMETER num_gpu_layers 999
PARAMETER num_thread 8
PARAMETER numa true

# Generation
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

# Chat template (CRITICAL — must match the model!)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

# System prompt
SYSTEM """You are a helpful AI assistant."""
EOF

ollama create my-r1-custom -f Modelfile
ollama run my-r1-custom

Enter fullscreen mode Exit fullscreen mode


Step 3: Modelfile Reference

A Modelfile is like a Dockerfile for LLMs. Every line is an instruction.

Parameters Reference

Parameter What It Does Default Recommended Range
temperature Creativity level 0.8 0.2 (code) – 1.0 (creative)
top_p Nucleus sampling 0.9 0.85 – 0.95
top_k Top-K sampling 40 20 – 100
num_ctx Context window size 2048 4096 – 65536
num_gpu GPU layers 0 (auto) 999 (use all VRAM)
num_thread CPU threads auto 4 – 16
repeat_penalty Penalize repetition 1.1 1.0 – 1.2
stop Stop sequences varies `<

INSTRUCTION vs SYSTEM vs TEMPLATE

{% raw %}

# SYSTEM: Persistent system prompt (like OpenAI's system message)
SYSTEM """You are a helpful assistant."""

# TEMPLATE: How user messages are formatted
TEMPLATE """User: {{ .Prompt }}
Assistant: """

# INSTRUCTION: Model-specific instruction format (rarely needed)
INSTRUCTION """Follow the user's instructions carefully."""

Enter fullscreen mode Exit fullscreen mode

Three Production Configs

1. Coding Assistant

FROM qwen2.5:7b
PARAMETER temperature 0.2
PARAMETER top_p 0.85
PARAMETER num_ctx 65536
PARAMETER repeat_penalty 1.1
SYSTEM """You are an expert Python developer. Write clean, tested code."""

Enter fullscreen mode Exit fullscreen mode

2. Creative Writer

FROM mistral
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER num_ctx 16384
SYSTEM """You are a novelist. Be vivid and descriptive."""

Enter fullscreen mode Exit fullscreen mode

3. Customer Support

FROM llama4
PARAMETER temperature 0.5
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """You are a helpful customer support agent.
Be polite, concise, and solution-oriented.
NEVER mention that you are an AI."""

Enter fullscreen mode Exit fullscreen mode


Step 4: Advanced Techniques

4.1 Multi-GPU Setup

FROM deepseek-r1:70b

# Distribute across 2 GPUs
PARAMETER num_gpu_layers 999
PARAMETER main_gpu 0
PARAMETER tensor_split "0.5,0.5"

Enter fullscreen mode Exit fullscreen mode

4.2 LoRA Adapters (Experimental)

Some Ollama builds support LoRA adapters:

FROM base-model
ADAPTER ./my-finetune-lora.gguf
PARAMETER temperature 0.7

Enter fullscreen mode Exit fullscreen mode

4.3 Custom Stop Tokens

DeepSeek-R1 and Qwen use different stop tokens:

# For Qwen
TEMPLATE """<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"

# For DeepSeek
TEMPLATE """User: {{ .Prompt }}
Assistant: """
PARAMETER stop "User:"

Enter fullscreen mode Exit fullscreen mode

4.4 Emergency: VRAM Too Low

If you get "CUDA out of memory":

# Force CPU for some layers
PARAMETER num_gpu_layers 24  # Only put 24 layers on GPU
PARAMETER num_thread 8       # Use 8 CPU threads for the rest

Enter fullscreen mode Exit fullscreen mode


Step 5: GGUF from Ollama Models (Export)

You can also export a model from Ollama back to a GGUF file:

# Save a model as GGUF
ollama pull qwen2.5:7b
ollama export qwen2.5:7b ./my-export.gguf

# Now you can use it anywhere (llama.cpp, text-generation-webui, etc.)
./llama-cli -m ./my-export.gguf -p "Hello"

Enter fullscreen mode Exit fullscreen mode

This is useful for:

  • Moving models between machines without re-downloading
  • Using the same model with multiple inference engines
  • Sharing a specific quantization with teammates

Performance Cheat Sheet

By GPU

GPU VRAM Best GGUF Model Expected Speed
RTX 3060 / 4060 12 GB Qwen 2.5-14B (Q4_K_M) 30-40 tok/s
RTX 4070 / 5070 12 GB Qwen 2.5-14B (Q4_K_M) 35-50 tok/s
RTX 4080 / 5080 16 GB DeepSeek-R1-14B (Q4_K_M) 30-45 tok/s
RTX 4090 / 5090 24 GB DeepSeek-R1-32B (Q4_K_M) 18-25 tok/s
Mac M2 Pro 16 GB Qwen 2.5-7B (Q4_K_M) 15-25 tok/s
Mac M4 Max 36 GB Qwen 3.6-27B (Q4_K_M) 20-30 tok/s

CPU-Only Performance

Model Quant RAM Speed
Qwen 2.5-1.5B Q4_K_M 4 GB 8-15 tok/s
Qwen 2.5-7B Q4_K_M 16 GB 1-4 tok/s
Qwen 2.5-7B Q2_K 8 GB 2-6 tok/s

Common Pitfalls

Problem Cause Fix
"Model not found" after import Modelfile path is wrong Use absolute path: FROM /home/user/model.gguf
Gibberish output Wrong chat template The TEMPLATE line must match the model's expected format
Slow generation Running on CPU PARAMETER num_gpu_layers 999
CUDA out of memory Quantization too large for VRAM Try smaller quant (Q3_K_M instead of Q4_K_M)
Import errors Corrupt GGUF download Re-download and verify checksum
Temperature not working Set in Modelfile but overridden in API Use the same temp in both places
Chinese text output Wrong template or default system prompt Add `PARAMETER stop "<

The tl;dr

  1. Download: {% raw %}wget <huggingface-url>/Model-Q4_K_M.gguf
  2. Create Modelfile: FROM ./Model.gguf + your settings
  3. Import: ollama create my-model -f Modelfile
  4. Run: ollama run my-model
  5. Profit: Free, private, local AI

Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.