惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Cx Dev Log — 2026-04-24 github's agent audit api is the boring feature that matters # From Teaching Code to Building Real-World Applications Vivado 2026.1 and Linux: why this decision matters beyond the headline Vivado 2026.1 y Linux: por qué la decisión importa más allá del titular ORA-00206 오류 원인과 해결 방법 완벽 가이드 Entidades finas e composição: o design que escolhi para a nova plataforma 10 Open Source Tools Every Developer Should Know 🔥 SSH Config File Mastery: Turning `~/.ssh/config` Into a Productivity Tool I tried to create a programming language... in python I Replaced 70MB Node.js Log Viewer with a 172KB Zig Binary I Turned npm outdated into a CI Gate — Here's How Don't fall for the Claude Mythos hype Vestige: A Gemma 4 Brain Tracker That Won't Blow Smoke Up Your Ass Gemminate: Transforming Static Textbooks into Interactive Learning Journeys with Gemma 4 Where Did All the Code Playgrounds Go? I built PROOFER - Privacy first Chrome extension that proofreads your texts using Gemma 4 I Automated My Entire Digital Product Business on a $13/Month GCP VM. Here's the Architecture. Beginner's Mind in Engineering and AI How I use AI agents to turn ideas into public demos I Built a Quotation Generator for Kenyan Street Welders Using Gemma 4's Vision The Math Behind Neural Networks — Explained Like Nobody Did for Me 🧨 Understanding TPC with IEEE802.11h What I’m Starting to Look for in Engineers An npm Downloads Comparison Chart in 300 Lines of Vanilla JS — Nice-Tick Math and API-Direct Fetch Vitreus: Local-First Spreadsheet Intelligence with Gemma 4 Transfer Fees, Metadata, and Soulbound Tokens: A Tour of Solana Token Extensions I got tired of re-explaining my codebase to ChatGPT — so I built a VS Code extension Revisiting My Phone AI After Gemma 4: The Upgrade I Didn't Know I Needed I built a privacy-first PDF merger in 7 hours — here's the stack and the lessons Google I/O 2026 made me ask an uncomfortable question: are we still coding, or are we managing builders? SSR with JavaScript: Escaping Node.js Clunkiness with AxonASP My CKA Exam-Day Experience: What Went Right, What Went Wrong, and Lessons Learned Gemma 4 Soft Tokens: The Rise and Fall of 16x16 Words ⚡👀 Two weeks ago, I built a private AI brain on my phone using Gemma 4. Yesterday, Google dropped a new variant that made everything I built feel like a beta test. 256M parameters. MoE architecture. Apache 2.0 license. I broke down what changed and why it mat I got tired of clicking through the Stripe dashboard, so I built a CLI Getting Data from Multiple Sources in Power BI: A Practical Guide to Modern Data Integration Google Is No Longer Just a Search Engine I built GemmaPod - A truly composable and portable AI agent solution powered by your local LLM Gemma 4 E4B caught three planted fabrications in 50 seconds — on a laptop, no cloud How to build an AI-powered content moderation pipeline for user comments Running Gemma 4 on a Modest Machine: Unsloth vs LM Studio vs llama.cpp vs Ollama AI Makes Building Cheap. Our Product Architectures Still Assume It’s Expensive. I built an in-browser Roku TV remote with ~80 lines of TypeScript. Here's how Roku's ECP API actually works The Direction of Blame babbled notes: a sound-to-music agent for people who could not make music before How I Built a Live SQL Workshop Where Students Can't Break Anything Rescuing a Stranded Protocol: Re-Skinning Legacy Code for the Trestle DeFi Flywheel SOLID Heuristics Reveal Incomplete Domain Knowledge — Nothing More AllasCode Intitute / FullAgenticStack: The Intent-Based Router Introducing LogicGrid — Multi-Agent AI Orchestration for .NET AI Prompt Injection, Drupal SQLi Exploitation, and Nmap for Hardening AI Agents & Python Workflows: Anthropic Skills, Jupyter Challenges, and Edge Deployment SQLite Optimization, PostgreSQL Async Queries, & DuckLake Dataframe Spec RTX 5080 Undervolt Benchmarks, CGO-Free CUDA API Binding, & AMD GPU Compatibility Fix Microsoft Burned Its 2026 AI Budget on Claude Code in Six Months. That's the Real Story. Why I Started Learning FastAPI in 2026 I Abandoned Ghost for Months — Then Came Back and Finally Finished It Building an Open MIT-Licensed Ephemeris Engine in C — JPL Moshier Ephemeris 4 Smart Ways to Manage Retries in Side Projects Securing Web APIs: A Practical Guide to Authentication & Authorization Methods Google I/O 2026: AI Built an OS in 12 Hours. I Spent Mine Sorting Screenshots. 🤦 Half a Day, Not a Week: One Nix Flake for Three Machines 🌱 Keep Feeding Your CI/CD — Or Watch It Die Gemma 4 vs GPT-4o vs Llama 3: What Actually Works Locally? Vessel Ops SSH in 2026: Why Every Developer Should Know It Cold Audit AI-Generated PRs Before You Merge Them (Swarm Orchestrator 10.3.0) App Store Optimization (ASO) I built a tool to visualize Django REST Framework architecture (URLs, Serializers, Models, and more) How I made my React site agent-ready in 100 lines AI Can Generate Interfaces on the Fly. But Users Still Need Orientation. AI-Assisted Content Workflow How We Learned That Most Resume Rejections Happen Before Humans See Your CV How I Prepared for CKA: Resources, Labs, and Strategy That Worked for Me Remix Mini PC: Moving the Whole Operating System Onto the eMMC Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks The Misleading "User is not authorized to access connection" Error in AWS CodeBuild — and Why Your IAM Policy Looks Fine I Resurrected a Dead F1 Project and Accidentally Built a Race Intelligence OS Remix Mini PC: After a Year of Dead Ends, the eMMC Finally Talks Not All Games Are Equal: The Real Difference Between a Trap and a Tool How to add Peppol e-invoicing to your SaaS without making it your team's problem I Built a Hermes Agent to Tell Me Which Hackathons to Enter. It Told Me to Enter This One. The Five Hooks That Change How You Ship With Claude Code Powering Your Progress: Building Robust Solutions with Laravel I built a self-hosted CI/CD platform with persistent queue, encrypted secrets, and rollback UI — here's what I learned Antigravity 2.0 and the $1,000 OS: Why "Agent-First" Feels Like the Direction I've Been Building Toward Anyway I built an AI PR-triage agent in 30 lines of Markdown Core Web Vitals from 74 to 91: A Real Tax Practitioner Site Rebuild I Gave Gemma 4 150 Tools on Windows. Here's What Actually Happened. Beyond the Loop: Why Monolithic AI Agents Fail and How to Build a Microkernel Architecture The Hidden Tax of AI-Assisted Development (And How I Fixed It) I Ditched Cloud LLMs for Gemma 4 4B: A DevOps Engineer's 48-Hour Reality Check Building a Schema.org @graph That Validates on the First Try The "Lift and Shift" Trap: Why Your Integration Layer Needs More Than Just a Cloud Address All 7 OSI Layers Explained with Real-World Analogies Antigravity 2.0 in one day: the four shells and what each is good for Self-Hosting Google Fonts with size-adjust: Zero CLS Web Font Swap The Multi-Provider LLM Problem: Why “One API” Is Not Enough How I indexed 69,000 Claude Code skills (and what I learned doing it)
로컬 LLM 셋업 가이드 (v18)
matias yoon · 2026-05-25 · via DEV Community

matias yoon

Local LLM Setup Guide (v18)

1. Overview & Prerequisites

Running LLMs locally requires minimal hardware but careful resource management. This guide assumes:

  • Ubuntu 20.04/22.04 or Debian 11/12
  • 8GB+ RAM (16GB+ recommended)
  • NVIDIA GPU with CUDA support (RTX 3060+), or CPU-only setup
  • 20GB+ free disk space for models

For GPU-accelerated inference, install CUDA:

# Install NVIDIA drivers
sudo apt update
sudo apt install nvidia-driver-535

# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-4

Enter fullscreen mode Exit fullscreen mode

2. Framework Comparison

Framework GPU Support Ease of Use Performance Best For
llama.cpp Yes Medium Fast Quick prototyping
Ollama Yes Easy Fast Development/testing
vLLM Yes Medium Fastest Production inference
LocalAI Yes Easy Fast API-first workflows

Recommendation: Use llama.cpp with Ollama for development workflow.

3. Step-by-Step Installation

Install llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Enter fullscreen mode Exit fullscreen mode

Install Ollama (for easier model management):

curl -fsSL https://ollama.com/install.sh | sh

Enter fullscreen mode Exit fullscreen mode

Test installation:

ollama run llama3:8b

Enter fullscreen mode Exit fullscreen mode

Setup model directory:

mkdir -p ~/llm-models
cd ~/llm-models

Enter fullscreen mode Exit fullscreen mode

4. Model Selection Guide

Use Case: Code Generation

  • Model: codellama:7b or phi3:3.8b
  • RAM: 8GB minimum
  • Command: ollama run codellama:7b

Use Case: Chatbot

  • Model: llama3:8b or mistral:7b
  • RAM: 8GB minimum
  • Command: ollama run llama3:8b

Use Case: High Precision

  • Model: llama3:70b or mixtral:8x7b
  • RAM: 16GB minimum
  • Command: ollama run llama3:70b

5. Quantization Types Explained

Quantization reduces model size while maintaining performance:

  • Q4_K_M: 4-bit quantization, 4.5GB for 7B model
  • Q5_K_M: 5-bit quantization, 5.5GB for 7B model
  • Q8_0: 8-bit quantization, 8GB for 7B model
  • F16: Full precision, 16GB for 7B model

Example: Download and convert model:

# Download 7B model
ollama pull llama3:8b

# Convert to Q4_K_M (smallest size, good performance)
ollama run llama3:8b --quantize Q4_K_M

Enter fullscreen mode Exit fullscreen mode

6. API Setup and Integration

Create API server with llama.cpp:

# Start llama.cpp server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192

Enter fullscreen mode Exit fullscreen mode

Test API:

curl http://localhost:1234/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Write a Python function to reverse a string.",
        "temperature": 0.7,
        "max_tokens": 100
    }'

Enter fullscreen mode Exit fullscreen mode

Integrate with Python:

import requests

def llm_query(prompt):
    response = requests.post(
        'http://localhost:1234/completion',
        json={
            'prompt': prompt,
            'temperature': 0.7,
            'max_tokens': 200
        }
    )
    return response.json()['content']

# Usage
result = llm_query("Explain quantum computing in simple terms")

Enter fullscreen mode Exit fullscreen mode

7. Systemd Service for 24/7 Operation

Create service file:

sudo nano /etc/systemd/system/llm-server.service

Enter fullscreen mode Exit fullscreen mode

Content:

[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=your_username
WorkingDirectory=/home/your_username/llama.cpp
ExecStart=/home/your_username/llama.cpp/server \
    -m /home/your_username/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llm-server
sudo systemctl start llm-server

Enter fullscreen mode Exit fullscreen mode

8. Monitoring and Performance Tuning

Monitor GPU usage:

nvidia-smi -l 1  # Update every second

Enter fullscreen mode Exit fullscreen mode

Monitor memory usage:

watch -n 1 free -h

Enter fullscreen mode Exit fullscreen mode

Benchmark inference:

# Test 100 token generation
time ./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --prompt "The future of AI is" \
    --max-tokens 100 \
    --threads 8

Enter fullscreen mode Exit fullscreen mode

Performance tuning parameters:

  • --ctx-size: 8192 for 8B models, 16384 for 70B models
  • --threads: CPU cores / 2 for optimal performance
  • --n-gpu-layers: Number of layers on GPU (default: 100 for 8B models)

9. Real Command Examples

Full workflow example:

# 1. Install dependencies
sudo apt update
sudo apt install git cmake build-essential

# 2. Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# 3. Download model
ollama pull llama3:8b

# 4. Start server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192 \
    --n-gpu-layers 100

# 5. Test API
curl http://localhost:1234/completion \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Hello world", "max_tokens": 10}'

Enter fullscreen mode Exit fullscreen mode

Production-ready startup script:

#!/bin/bash
# ~/start-llm.sh

MODEL_PATH="$HOME/llm-models/llama3-8b-Q4_K_M.gguf"
PORT=1234

if [ ! -f "$MODEL_PATH" ]; then
    echo "Model not found at $MODEL_PATH"
    exit 1
fi

echo "Starting LLM server on port $PORT..."
./server \
    -m "$MODEL_PATH" \
    --host 0.0.0.0 \
    --port $PORT \
    --threads 8 \
    --ctx-size 8192 \
    --n-gpu-layers 100

Enter fullscreen mode Exit fullscreen mode

This setup provides a production-ready local LLM infrastructure with minimal hardware requirements and optimal performance. The combination of llama.cpp for low-level control and Ollama for easy model management gives developers the best of both worlds for local LLM development.


📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)