惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Managing multiple docker hub accounts using docker-use System Design Interview: Decentralized Web Crawler Metric Cardinality: High or Low? 4 Steps to Making the Right Choice GEO vs SEO in 2026 — What Google's May Guidance Changed Cursor Review 2026 — Honest 'Not For Me' Take From a VSCode User Hello from rikuq — a practitioner blog for solo AI SaaS founders Why DevOps Engineers Need Practical Tutorials, Not Just Theory AI Agents in CI/CD: Give Them Context, Not Production Authority Why I Track HRV Every Morning (And How It Actually Changes My Day) Now I See Why Translators Are Panicking Over AI—Should Coders Panic Too? Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation Chatbots GPT pour le support client : ce que les équipes françaises ont réellement besoin de savoir I Hit the 1,232-Byte Wall So You Don't Have To Google Just Rebuilt the Search Box (Again) — But This Time It's Different Aether: A local Android assistant built with Gemma 4 BoxAgnts Introduction (1) — Out of the Box mkdev: trusted HTTPS for localhost, mapped by name Just one question, one answer. Why Java Still Rules the Programming World in 2026 Four Architectures for Letting Claude Edit Elementor (and Why We Shipped Clone-and-Mutate) yard-yaml 0.1.1: safer UTF-8 handling for YAML documentation I Built a Mac App That Keeps Your Clipboard in Sync Across All Your Android Devices Stop Using UUIDs: Why B2B SaaS Needs ULIDs in Laravel 🐘 I'm a non-technical founder who built a Slack approval tool. Here's what actually broke first. Open-Sourcing Our Game AI Stack — SDKs, Templates, and CLI Tools for NPC Dialogue I Built an AI System That Makes 1,000 Decisions a Day. Here's Where I Drew the Line. Lets Encrypt DNS Challenge with Traefik and AWS Route 53 Building an agent-ready website: how to make your site readable for ChatGPT, Perplexity and autonomous agents A productivity tool with GitHub as your cloud database How We Built Dynamic NPC Dialogue with LLMs — Lessons from Early Access cmux: The Native macOS Terminal Built for Running AI Coding Agents in Parallel Deep Atlantic Storage: Rewriting in Rust How I Built a Bulk Image Optimizer with $0 Server Costs Using Vanilla JS and Canvas API Humans and Machines read differently, I think I have a fix? Claude Code Deleted 92 Images Without Asking. This Happens More Than You Think. Method Calling Stack in Java I Built Schedule Sensei & Pushed It to GitHub – Here's What's Inside (And I Need Your Help 👀) OIC: From a Working Toast Watcher to a General "Watch It for Me" Agent Memory is two-thirds of what an AI chip costs to build The XState persistence problem is five years old. Here is what we built to finally solve it. i added MCP support to my SaaS in an afternoon. here's the whole thing. Framework: Link Building ☁️ Importing existing S3 buckets into Terraform state made easy with terraform import existing s3 bucket I Built a Token System on Solana (Without Any Backend Code) 터미널 AI 에이전트 구축 (v21) I Built an AI 3D Model Generator — Here's How I Handle Meshes in the Browser 🛡️ PromptGuard: I Built a Local AI Privacy Firewall That Sanitizes Your Prompts Before They Leave Your Machine PostgreSQL WAL Bloat: Why Automatic Management Is Often Insufficient? Seven PRs Before Lunch: Parallel Claude Code Tabs Plus Audit-Before-Bump Deployment using all three Kubernetes probes Qwen 3.6 Has Four Tiers. Here's How to Route Without Burning Cash. RAG 시스템 실전 구축 (v21) How I handle my errors in PHP The Blind Spot in Treasure Hunt Engine Configuration: Long-Term Server Health Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint Webflow SEO Implementation 로컬 LLM 셋업 가이드 (v21) How Logs Travel From Your EKS Pod to Datadog 𝗦𝘁𝗼𝗽 𝗖𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗙𝗼𝗿 𝗘𝘅𝗮𝗺𝘀, 𝗦𝘁𝗮𝗿𝘁 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗦𝗸𝗶𝗹𝗹𝘀 How to Use EXPLAIN ANALYZE in PostgreSQL: A Visual Guide gRPC Performance: tonic (Rust) vs grpc-go Benchmarked at Scale Hack The Box (HTB): Cap Machine (Full Walkthrough) Visual Search Optimization studygemma: AI study buddy for CS students Architectural Tradeoffs in Webhook Idempotency and SaaS API Versioning One Open Source Project a Day (No. 75): Understand Anything - The AI Engine That Turns Any Codebase Into an Explorable Knowledge Graph From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging I built a free music tool AI Talking Avatar Pipelines Broke Our Ad CTR by 3.7% 800G to 400G Breakout: How to Scale 400G Networks with 800G Ports 터미널 AI 에이전트 구축 (v20) Topical Authority Architecture Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does How Logs Travel From Your EKS Pod to Datadog The Hidden Journey Inside / Kubernetes Is it safe to connect my bank account to AI? No Room — The World of Aying (8/12) Fossils — The World of Aying (10/12) Familiar Stranger — The World of Aying (9/12) Being Seen — The World of Aying (7/12) [I Ran an AI Agent for 30 Days Straight — Here's the Boring Engineering That Made It Work] Gemma 4: The 128K Multimodal Powerhouse in Your Terminal How to Consolidate Your QA Toolstack: A Practical Buyer's Guide The Thank-You Email Almost Nobody Sends (And Why That's Your Edge) Schema Types 2026 Idempotency Keys: The API Safety Net You're Probably Not Using How to let Claude see my Plaid bank data Kiro Did It: Build a Simple Portfolio Website with Kiro IDE | From Prompt to HTML Prototype Islands of Commerce: What Marketplace Founders Can Learn from 60 Years of Island Biogeography React Pointer Hooks: Hover, Long-Press, Double-Click, Scratch, and Click-Outside Without the Bugs Engineering decisions for my video call tool VBScript Still Lives: How a Custom Go VM Brought Classic ASP to Linux and Mac What Happens When You Teach Old Scripting Languages New Runtime Tricks? I Tested 6 AI Coding Assistants for a Month. Here's What Actually Works. Extendscript Still Has Life Afriex Webhook Integration Guide: Signature Verification, Event Handling, and Production Best Practices The Blind Alleys of Veltrix Configuration How an ESP32 Turned a LEGO WALL-E Into a Real Working Robot The Flawed Promise of Real-Time Event Handling SSH Login Taking Forever? Check Your DNS Settings Found 897 Fake Followers on DEV.to. Here's How I Proved It.
로컬 LLM 셋업 가이드 (v23)
matias yoon · 2026-05-25 · via DEV Community

로컬 LLM 셋업 가이드 (v23)

1. 개요 및 사전 준비

로컬 LLM(대형 언어 모델)을 실행하는 것은 비용 효율적인 방법으로 AI 기능을 통합할 수 있는 가장 간단한 접근 방식입니다. 이 가이드는 Linux 기반 시스템에서 로컬 LLM을 설정하고 최적화하는 실용적인 방법을 제공합니다.

사전 요구사항

  • 운영 체제: Ubuntu 20.04 이상 또는 Debian 11 이상
  • 하드웨어:
    • GPU: NVIDIA RTX 30xx 이상 (최소 8GB VRAM)
    • CPU: 최소 8코어
    • RAM: 최소 32GB (64GB 이상 권장)
    • 저장소: 최소 100GB 여유 공간

시스템 확인

# GPU 확인
nvidia-smi

# RAM 확인
free -h

# CPU 확인
lscpu

Enter fullscreen mode Exit fullscreen mode

2. 프레임워크 비교

프레임워크 장점 단점 추천 사용 사례
llama.cpp 빠른 설치, 최적화된 C++ 구현, 최소 의존성 API 서버 미포함 단일 모델 실행
Ollama 쉬운 설치, 간단한 API, 이미지 기반 배포 메모리 사용량 높음 개발/테스트 환경
vLLM 최고의 성능, 대규모 토크나이즈 처리 복잡한 설치 과정 프로덕션 환경
LocalAI 다양한 API 호환성, 클라우드 연계 기술 지원 제한 API 기반 어플리케이션

3. 추천 설정 - llama.cpp 설치

llama.cpp는 가장 적절한 선택입니다. 간단하고 빠르며 최적화된 성능을 제공합니다.

# 설치 전 준비
sudo apt update
sudo apt install build-essential git -y

# llama.cpp 다운로드 및 컴파일
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# 컴파일
make clean
make

# 필수 라이브러리 설치 (필요시)
pip install torch numpy

# 모델 다운로드 (예시: LLaMA-2 7B)
mkdir -p models
wget https://huggingface.co/llamav2-7b/resolve/main/llama-2-7b.gguf -O models/llama-2-7b.gguf

Enter fullscreen mode Exit fullscreen mode

4. 모델 선택 가이드

사용 사례 추천 모델 설명
일반 텍스트 생성 LLaMA-2 7B (Q5_K_M) 균형 잡힌 성능과 정확도
빠른 추론 Mistral 7B (Q4_K_M) 빠른 추론 속도
고정밀도 Phi-3 3.8B (Q4_K_M) 정밀한 답변
코드 생성 CodeLLaMA 7B (Q4_K_M) 프로그래밍 관련 작업

5. 양자화 유형 설명

# 양자화 유형별 설명
# Q4_K_M: 최적화된 4비트 양자화, 높은 성능/정확도 비율
# Q5_K_M: 5비트 양자화, 정확도 향상
# Q6_K: 6비트, 최고 정확도
# Q8_0: 8비트, 최대 정확도

Enter fullscreen mode Exit fullscreen mode

실제 모델 변환 예시

# Q5_K_M 양자화
./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q5_k_m --outfile models/llama-2-7b-q5k.gguf

# Q4_K_M 양자화
./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q4_k_m --outfile models/llama-2-7b-q4k.gguf

Enter fullscreen mode Exit fullscreen mode

6. API 설정 및 도구 통합

# llama.cpp API 서버 시작
./server -m models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080

# API 테스트
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello, how are you?",
    "n_predict": 128,
    "temperature": 0.7
  }'

Enter fullscreen mode Exit fullscreen mode

외부 도구 통합 예시 (Python)

import requests

def llama_completion(prompt, max_tokens=128, temperature=0.7):
    response = requests.post(
        "http://localhost:8080/completion",
        json={
            "prompt": prompt,
            "n_predict": max_tokens,
            "temperature": temperature
        }
    )
    return response.json()['content']

# 사용 예시
result = llama_completion("Python에서 JSON 파싱 방법은?")
print(result)

Enter fullscreen mode Exit fullscreen mode

7. Systemd 서비스 설정

24시간 실행을 위해 systemd 서비스를 설정합니다.

# 서비스 파일 생성
sudo nano /etc/systemd/system/llama.service

# 서비스 내용
[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=your_user
WorkingDirectory=/home/your_user/llama.cpp
ExecStart=/home/your_user/llama.cpp/server -m /home/your_user/llama.cpp/models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# 서비스 활성화
sudo systemctl daemon-reload
sudo systemctl enable llama.service
sudo systemctl start llama.service

Enter fullscreen mode Exit fullscreen mode

8. 모니터링 및 성능 최적화

성능 모니터링 스크립트

# 성능 모니터링 스크립트 (monitor.sh)
#!/bin/bash
while true; do
    echo "=== Memory Usage ==="
    free -h
    echo "=== GPU Usage ==="
    nvidia-smi
    echo "=== CPU Load ==="
    top -bn1 | grep "Cpu(s)"
    sleep 30
done

Enter fullscreen mode Exit fullscreen mode

최적화 옵션

# 빠른 추론 (적은 메모리 사용)
./server -m models/llama-2-7b-q5k.gguf -c 512 -n 128

# 최대 성능 (높은 메모리 사용)
./server -m models/llama-2-7b-q5k.gguf -c 2048 -n 2048 --threads 8

# GPU 메모리 최적화
./server -m models/llama-2-7b-q5k.gguf --gpu-layers 30 -c 1024

Enter fullscreen mode Exit fullscreen mode

9. 실제 성능 벤치마크

추론 성능 테스트

# 성능 테스트
./server -m models/llama-2-7b-q5k.gguf -c 2048 --port 8081

# 빠른 테스트
ab -n 10 -c 5 http://localhost:8081/completion

# 실제 요청 테스트
curl -X POST http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "n_predict": 10}' \
  -w "%{time_total}s\n"

Enter fullscreen mode Exit fullscreen mode

추론 시간 기록 (예시)

LLaMA-2 7B (Q5_K_M):
- 문맥 길이 512: 0.8초
- 문맥 길이 1024: 1.2초
- 문맥 길이 2048: 2.1초

Mistral 7B (Q4_K_M):
- 문맥 길이 512: 0.5초
- 문맥 길이 1024: 0.9초
- 문맥 길이 2048: 1.6초

Enter fullscreen mode Exit fullscreen mode

10. 실전 사용 사례


📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)