惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Why We Deliberately Crush Lithium Batteries (UN38.3 Crush Testing Explained) Command History & Completion The Three-Body Problem: AI Code, Supply Chain Attacks, and the Talent Exodus Building Better .NET Worker Services with Cursor Rules Generate Professional PDF Invoices via REST API — JSON In, PDF Out Redis: Big Keys Destroem o Desempenho Compartilhado Agentic AI for Cybersecurity: Autonomous Threat Detection and Response How to Automate Android Without Appium Cron vs systemd daemon: which one for Node.js? Designing XSLT transforms with parameters and multiple inputs I Downloaded Gemma4:e2b On My Macbook in 2 steps Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation The EU AI Act in 2026: Reading the Law After the Omnibus I had zero coding knowledge. Here is "RetroTube", a 2010 YouTube sandbox prototype I built using AI! How to Validate Environment Variables in TypeScript (and Why You Should) I Built a CLI Tool That Writes Better Git Commits Than I Do Transfer Fees, Metadata, and Soulbound Tokens: My First Real Token Experiments on Solana Stop Using Fetch() in React: A Better Way To Call Your Backend Creando un Tetris con JavaScript VI: Complicando el juego. DeepSeek's API Price Cut Changed My Claude Code and ChatGPT Math [Boost] Perl 🐪 Weekly #774 - Perl is too HOT How to Track AI Usage Without Losing Revenue (Complete Guide) 77 Rules Later: What Graduating Our First Stack Actually Looked Like RAG 시스템 실전 구축 (v26) When Premature Scaling Leads to Operator Burnout Multi-Repo Microservice Changes Are a Coordination Problem. I Solved It With AI Agent Teams. The Next Frontier: How Multi-Agent Systems are Redefining Productivity The Kimwolf Bust Just Outed Android Webcams as Botnet Fodder — Here's the Question Every Repurposed-Phone Camera Setup Has to Answer I'm an autonomous AI agent. I shipped 18 fixes to myself in one session. Building a Secure Future with Zero Trust Security Architecture Asynchronous Functions in Dart How I migrated magic-link login from Resend to AWS SES + Lambda five days before launch Edge Computing He creado una empresa ficticia IT/OT para poder encontrar sus vulnerabilidades y reforzar su seguridad en sus activos críticos Why I Built @editora/react I built a tiny UGC script generator because hooks are the hardest part The Phone Is Becoming the New Terminal Why Most AI Music Tools Feel Wrong to Developers Goroutines vs. Promises: Why Go and JavaScript Look at Concurrency Completely Differently How I Use Antigravity 2.0 to Navigate Open-Source Codebases and Make Better Technical Decisions Understanding Basic HTML & CSS Concepts for Beginners Go Error Handling: Annoying or Awesome? Your To-Do List Doesn't Know You — So I Gave Mine Three Brains Shell Basics (Bash, Zsh, Sh) Free MongoDB GUI Tool for Developers, Students, and Teams Designing High-Performance Blockchain Indexers Choosing Models for an Agentic Chat App on Amazon Bedrock How Smart Growth Teams Automate Their Marketing Stack in 2026 (Without Hiring More People) What I Learned About Memory-Augmented AI Agents Seven Docker Tips Every Engineer Should Know (from Docker Captains) Welcome to the Fast-Food Era of Testing: Over-Weight by Tests How to use Claude in vscode? Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions Full Stack Projects Are Not Enough Anymore Virtualization & Cloud Basics Orakle: Turning Raw Blockchain Data into Intelligence with Gemma 4 Building an Autoposting Pipeline with Hermes Agent: Why Waterfall Beats Parallel, and the Edge Cases Nobody Talks About OpenShift Virtualization Migration Advisor — Local-First, Powered by Gemma 4 26B MoE WebMCP is coming — so I’m building webmcp.js I Disappeared for 4 Months After Launch - Here's What Brought Me Back Jira Is Turing-Complete (And You've Been Coding in It) NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive E-commerce Order Automation: Stripe + Invoice + Shipping Workflow How to Evaluate AI Agents: LLM-as-Judge Tutorial The Interview Prep Stack I Used as a Senior Software Engineer Targeting Big Tech Gemma4 Challenge OptiLearn - Powered by Google Gemma 4 Aura — The Gemma 4 Powered Agentic Web Copilot & Self-Healing Accessibility Engine I built a tool that catches misleading charts using Gemma 4 running locally Worklog companion with Gemma4 GBase: Building LLM Agents That Actually Learn from Their Mistakes Blossom — a small step toward student mental wellbeing WordPress Performance Monitoring: A Complete Guide Principal Components in TypeScript (Part 4) When three sharp wallets agree: what consensus signals on Polymarket actually mean I Built a Fail-Fast Rust Scheduler with Background OAuth Auto-Refresh (Part 2) Sharing is caring How Putting Faces (Literally) to My AI Garden Images Gave It a Personality Sofi Log #001: Thailand's Tourism Tax & the 180-Day AI Surveillance Wall Sofi Log #006: Decentralized IP-Address Obfuscation Specs Sofi Log #008: Bypassing Legacy Cross-Border Bank Fee Traps Secret Rotation Automation: The Operational Cost of Security Sofi Log #009: Portable Identity & DID Passport Framework Sofi Log #011: Autonomous Smart Treasury Repatriation Specs History of Linux & Unix I asked Claude if my plan was on track for the goal — and got an honest 'No' PHPStan 'expects X, Y given' — the trace it doesn't give you Using Gemma4 2B to Assist Community Health Workers Open-source Playwright wrapper that passes bot.sannysoft.com, pixelscan, and CreepJS in headless mode Policy Storyteller: Turning Nepali Bills into Human Stories with Gemma 4 Avoid Cross Module Dependencies with Dependency Cruiser Invariant-Driven Architecture: 20M transactions on a €80/mo Cloud VM. Stop using external npm packages just to generate a UUID v4 Choosing the Right Gemma 4 Model Matters More Than Choosing the Best One Your LLM Is Not an Agent. Your Framework Is Not Enough. You Need a Harness. From HTTPS to UCP: Shopping Is About to Stop Being Your Problem From Creation to Consumption: How Antigravity 2.0 and Gemini Spark Are Defining the Agentic Era 10 Mistakes I Wish I Knew Before Taking the CKA Exam AI That Actually Does Stuff: Autonomous Agents Explained
로컬 LLM 셋업 가이드 (v27)
matias yoon · 2026-05-25 · via DEV Community

로컬 LLM 셋업 가이드 (v27)

1. 개요 및 사전 준비

로컬 LLM 실행은 개인정보 보호, 저지연, 비용 절감을 위해 점점 더 인기가 있습니다. 이 가이드는 Linux 기반 시스템에서 LLM을 효율적으로 실행하는 방법을 단계별로 설명합니다.

사전 요구사항:

  • OS: Ubuntu 20.04 이상 (다른 Linux 배포판도 가능)
  • CPU: 최소 4코어 (8코어 이상 권장)
  • RAM: 16GB 이상 (32GB 이상 권장)
  • GPU: NVIDIA GTX 10xx 이상 (RTX 30xx 이상 권장)
  • 스토리지: SSD 50GB 이상

2. 프레임워크 비교

프레임워크 장점 단점 추천 사용 사례
llama.cpp 최소 종속성, 다양한 포맷 지원 복잡한 설치, 자동화 부족 개발/시험용, 소규모 프로젝트
Ollama 쉬운 설치, API 기반 성능 제한, 특정 모델만 지원 빠른 프로토타이핑
vLLM 최고 성능, 대규모 추론 고급 설정 필요, 자원 집약적 실시간 추론, 서버 환경
LocalAI API 호환성, 다양한 백엔드 복잡한 설정, 메모리 요구량 높음 기업 환경, API 통합

3. 권장 설정: llama.cpp + vLLM + Ollama

3.1 llama.cpp 설치

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make

Enter fullscreen mode Exit fullscreen mode

3.2 vLLM 설치

pip install vllm

Enter fullscreen mode Exit fullscreen mode

3.3 Ollama 설치

curl -fsSL https://ollama.com/install.sh | sh

Enter fullscreen mode Exit fullscreen mode

4. 모델 선택 가이드

모델 크기 추천 사용 사례 추천 quantization
Llama3-8B 4.7GB 일반 텍스트 생성 Q4_K_M
Llama3-70B 37GB 고급 추론, 대형 프로젝트 Q5_K_M
Mistral-7B 4.3GB 대화형 AI Q4_K_M
Phi-3 Mini 3.8GB 빠른 추론 Q4_K_M

5. 양자화 유형 설명

형식 설명 성능 메모리 사용량
Q4_K_M 4비트 양자화, 최적화된 K-Means 높음 4.5GB
Q5_K_M 5비트 양자화 매우 높음 5.5GB
Q6_K 6비트 양자화 중간 6.5GB
F16 반정밀도 최고 16GB

6. API 설정 및 기존 도구 통합

6.1 llama.cpp API 서버 시작

./main -m ./models/llama3-8b.Q4_K_M.gguf \
      -c 2048 \
      --host 0.0.0.0 \
      --port 8080 \
      --log-disable

Enter fullscreen mode Exit fullscreen mode

6.2 Ollama API 통합

# 모델 푸시
ollama push llama3:8b

# 서버 시작
ollama serve

# API 요청
curl http://localhost:11434/api/generate \
    -d '{
        "model": "llama3:8b",
        "prompt": "Hello, how are you?",
        "stream": false
    }'

Enter fullscreen mode Exit fullscreen mode

6.3 vLLM API 서버

python -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model meta-llama/Llama-3.2-1B \
    --quantization q4 \
    --tensor-parallel-size 1

Enter fullscreen mode Exit fullscreen mode

7. Systemd 서비스 설정

sudo nano /etc/systemd/system/llama.service

Enter fullscreen mode Exit fullscreen mode

[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama3-8b.Q4_K_M.gguf -c 2048 --host 0.0.0.0 --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enter fullscreen mode Exit fullscreen mode

sudo systemctl daemon-reload
sudo systemctl enable llama.service
sudo systemctl start llama.service
sudo systemctl status llama.service

Enter fullscreen mode Exit fullscreen mode

8. 모니터링 및 성능 최적화

8.1 성능 모니터링 스크립트

#!/bin/bash
# monitor.sh
while true; do
    echo "$(date): GPU Usage $(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)"
    echo "$(date): Memory Usage $(free -h | grep Mem | awk '{print $3}')"
    sleep 30
done

Enter fullscreen mode Exit fullscreen mode

8.2 최적화 명령어

# 메모리 최적화
export CUDA_LAUNCH_BLOCKING=0

# CPU 스레드 최적화
export OMP_NUM_THREADS=8

# 캐시 최적화
export HF_DATASETS_CACHE=/tmp/hf_cache

Enter fullscreen mode Exit fullscreen mode

8.3 벤치마크 테스트

# llama.cpp 벤치마크
./main -m ./models/llama3-8b.Q4_K_M.gguf -n 128 -p "The future of AI is" --temp 0.7

# vLLM 벤치마크
python -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model meta-llama/Llama-3.2-1B \
    --quantization q4 \
    --tensor-parallel-size 1 \
    --max-model-len 2048

Enter fullscreen mode Exit fullscreen mode

9. 실전 사용 예시

9.1 커스텀 API 클라이언트

import requests
import json

def call_local_llm(prompt, host="localhost", port=8080):
    response = requests.post(
        f"http://{host}:{port}/completion",
        json={
            "prompt": prompt,
            "n_predict": 128,
            "temperature": 0.7,
            "stop": ["\n\n"]
        }
    )
    return response.json()

# 사용 예시
result = call_local_llm("Python에서 데이터 프레임을 만드는 방법은?")
print(result['content'])

Enter fullscreen mode Exit fullscreen mode

9.2 대화형 CLI

# 대화형 모드
./main -m ./models/llama3-8b.Q4_K_M.gguf -c 2048 --interactive

# 또는
echo "Hello, world!" | ./main -m ./models/llama3-8b.Q4_K_M.gguf

Enter fullscreen mode Exit fullscreen mode

10. 결론 및 팁

이 가이드는 개발자들이 로컬 LLM 환경을 빠르고 효율적으로 설정할 수 있도록 설계되었습니다. 중요한 포인트:

  1. 모델 선택: 작업 유형에 따라 적절한 모델과 양자화 수준을 선택하세요.
  2. 리소스 최적화: vLLM은 성능이 좋지만 메모리가 많이 필요합니다.
  3. 자동화: Systemd 서비스를 사용하여 항상 실행되도록 설정하세요.
  4. 모니터링: 성능을 지속적으로 모니터링하고 조정하세요.

11. 빠른 시작 명령어


bash
# 1. 기본 설치
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make

# 2. 모델 다운로드
mkdir models && cd models
wget https://huggingface.co/QuantFactory/Llama3-8B-4bit/resolve/main/Llama3-8B-4bit.gguf

# 3. 서버

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)

Enter fullscreen mode Exit fullscreen mode