惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

The Blind Spot in Treasure Hunt Engine Configuration: Long-Term Server Health Webflow SEO Implementation 로컬 LLM 셋업 가이드 (v21) How Logs Travel From Your EKS Pod to Datadog 𝗦𝘁𝗼𝗽 𝗖𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗙𝗼𝗿 𝗘𝘅𝗮𝗺𝘀, 𝗦𝘁𝗮𝗿𝘁 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗦𝗸𝗶𝗹𝗹𝘀 How to Use EXPLAIN ANALYZE in PostgreSQL: A Visual Guide gRPC Performance: tonic (Rust) vs grpc-go Benchmarked at Scale Hack The Box (HTB): Cap Machine (Full Walkthrough) Visual Search Optimization studygemma: AI study buddy for CS students Architectural Tradeoffs in Webhook Idempotency and SaaS API Versioning One Open Source Project a Day (No. 75): Understand Anything - The AI Engine That Turns Any Codebase Into an Explorable Knowledge Graph From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging I built a free music tool AI Talking Avatar Pipelines Broke Our Ad CTR by 3.7% 800G to 400G Breakout: How to Scale 400G Networks with 800G Ports 터미널 AI 에이전트 구축 (v20) Topical Authority Architecture Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does How Logs Travel From Your EKS Pod to Datadog The Hidden Journey Inside / Kubernetes Is it safe to connect my bank account to AI? No Room — The World of Aying (8/12) Fossils — The World of Aying (10/12) Familiar Stranger — The World of Aying (9/12) Being Seen — The World of Aying (7/12) [I Ran an AI Agent for 30 Days Straight — Here's the Boring Engineering That Made It Work] Gemma 4: The 128K Multimodal Powerhouse in Your Terminal How to Consolidate Your QA Toolstack: A Practical Buyer's Guide The Thank-You Email Almost Nobody Sends (And Why That's Your Edge) Schema Types 2026 Idempotency Keys: The API Safety Net You're Probably Not Using How to let Claude see my Plaid bank data Kiro Did It: Build a Simple Portfolio Website with Kiro IDE | From Prompt to HTML Prototype Islands of Commerce: What Marketplace Founders Can Learn from 60 Years of Island Biogeography React Pointer Hooks: Hover, Long-Press, Double-Click, Scratch, and Click-Outside Without the Bugs Engineering decisions for my video call tool VBScript Still Lives: How a Custom Go VM Brought Classic ASP to Linux and Mac What Happens When You Teach Old Scripting Languages New Runtime Tricks? I Tested 6 AI Coding Assistants for a Month. Here's What Actually Works. Extendscript Still Has Life Afriex Webhook Integration Guide: Signature Verification, Event Handling, and Production Best Practices The Blind Alleys of Veltrix Configuration How an ESP32 Turned a LEGO WALL-E Into a Real Working Robot The Flawed Promise of Real-Time Event Handling SSH Login Taking Forever? Check Your DNS Settings Found 897 Fake Followers on DEV.to. Here's How I Proved It. Retry logic, Kafka consumer lag, and the hidden failure pattern that Kubernetes won’t catch WebMCP Might Be the Most Important Announcement at Google I/O 2026 Build a Secure API with Rails 8 - Part-3: Auth Controllers I A/B tested 4 LLMs on the same 500 queries. The results surprised me. Google I/O 2026’s Smartest Developer Release Wasn’t a Model, It Was the Runtime - Managed Agents in Gemini API OSS Monthly Recap: What My Daily Commit Challenge Taught Me About Open Source “Culture” GemmaNotes Cognitive Debt: AI Is Building Your Systems. Do You Actually Understand Them? GeekNews Frontend Weekly Deep Dive - 2026-05-25 I Built a Universal Silicon Loader That Runs on Any SOC (No Bootrom Exploit) Docker容器化部署Node.js应用最佳实践 I Put a Neural Network in a Thermometer — Then It Got Out of Hand Building MGZon: Developer Portfolio + AI Bot + Social Network (9 min demo) Bearing Life (L10): What the Catalog Number Really Tells You Longhorn Volume Health: The Gap Between 'Healthy' and Actually Working Stop Prompting. Start Specifying: How Spec-Driven Development Fixes AI Coding TIL a PowerPoint file is just a zip — so I converted .pptx to Word entirely in the browser 로컬 LLM 셋업 가이드 (v18) Cx Dev Log — 2026-04-24 github's agent audit api is the boring feature that matters # From Teaching Code to Building Real-World Applications Vivado 2026.1 and Linux: why this decision matters beyond the headline Vivado 2026.1 y Linux: por qué la decisión importa más allá del titular ORA-00206 오류 원인과 해결 방법 완벽 가이드 Entidades finas e composição: o design que escolhi para a nova plataforma 10 Open Source Tools Every Developer Should Know 🔥 SSH Config File Mastery: Turning `~/.ssh/config` Into a Productivity Tool I tried to create a programming language... in python I Replaced 70MB Node.js Log Viewer with a 172KB Zig Binary I Turned npm outdated into a CI Gate — Here's How Don't fall for the Claude Mythos hype Vestige: A Gemma 4 Brain Tracker That Won't Blow Smoke Up Your Ass Gemminate: Transforming Static Textbooks into Interactive Learning Journeys with Gemma 4 Where Did All the Code Playgrounds Go? I built PROOFER - Privacy first Chrome extension that proofreads your texts using Gemma 4 I Automated My Entire Digital Product Business on a $13/Month GCP VM. Here's the Architecture. Beginner's Mind in Engineering and AI How I use AI agents to turn ideas into public demos I Built a Quotation Generator for Kenyan Street Welders Using Gemma 4's Vision The Math Behind Neural Networks — Explained Like Nobody Did for Me 🧨 Understanding TPC with IEEE802.11h What I’m Starting to Look for in Engineers An npm Downloads Comparison Chart in 300 Lines of Vanilla JS — Nice-Tick Math and API-Direct Fetch Vitreus: Local-First Spreadsheet Intelligence with Gemma 4 Transfer Fees, Metadata, and Soulbound Tokens: A Tour of Solana Token Extensions I got tired of re-explaining my codebase to ChatGPT — so I built a VS Code extension Revisiting My Phone AI After Gemma 4: The Upgrade I Didn't Know I Needed I built a privacy-first PDF merger in 7 hours — here's the stack and the lessons Google I/O 2026 made me ask an uncomfortable question: are we still coding, or are we managing builders? SSR with JavaScript: Escaping Node.js Clunkiness with AxonASP My CKA Exam-Day Experience: What Went Right, What Went Wrong, and Lessons Learned Gemma 4 Soft Tokens: The Rise and Fall of 16x16 Words ⚡👀 Two weeks ago, I built a private AI brain on my phone using Gemma 4. Yesterday, Google dropped a new variant that made everything I built feel like a beta test. 256M parameters. MoE architecture. Apache 2.0 license. I broke down what changed and why it mat I got tired of clicking through the Stripe dashboard, so I built a CLI
Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint
Torkian · 2026-05-25 · via DEV Community

For Parts 1 through 3 we've been calling NIM through NVIDIA's hosted API Catalog at build.nvidia.com. That's the right starting point. It is also not the only place NIM runs.

NIM ships as a Docker container that exposes the same OpenAI-compatible HTTP API on a local port. Pull the image, run it on a box with an NVIDIA GPU, and the only thing that changes in the Python client is the base_url. The ask() function from Part 1, the retriever from Part 2, and the guardrails from Part 3 all keep working against the new endpoint, unchanged.

This post walks through the swap and the reasons you might want it.

I'm B Torkian, NVIDIA Developer Champion at USC. Same series, same code, just moving where inference happens.


Why bother running NIM locally

The hosted API Catalog is the right default. Don't switch until at least one of these matters:

  • Data locality. The data you're sending the model has to stay on a machine you control. (Common at universities, hospitals, regulated industries.) USC has a research GPU cluster — for projects where the source documents can't leave that environment, the model has to come to the data, not the other way around.
  • Predictable latency. Network round-trip + queue time + first-token latency adds up. A locally hosted model gives you a tighter, more predictable budget.
  • A real understanding of what's in the box. The hosted API hides a lot of useful detail. Running the container yourself surfaces the model files, the inference server, the GPU memory layout, and what knobs you actually have.
  • Cost at scale. Past a certain volume, running the model on hardware you already own becomes cheaper than per-token billing.

None of those matter for a 30-minute workshop. All of them might matter for the project the workshop is teaching you to build.


What you need

  • An NVIDIA GPU with enough VRAM for the model you want to run. For meta/llama-3.1-8b-instruct (the model we've been using), expect roughly 16 GB of VRAM. Heavier models want more.
  • Linux (native or WSL2). NIM containers expect the NVIDIA Container Toolkit, which means the --runtime=nvidia Docker flag works.
  • Docker with the NVIDIA Container Toolkit installed. Test with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi — it should print your GPU.
  • An NGC API key. The key you already have from build.nvidia.com works for pulling NIM images; if not, generate one at ngc.nvidia.com.

If you don't have a GPU box on hand, the rest of the workshop still teaches you something useful — the API shape is identical, so when you do get one, the Python client code does not change.


Step 1 — Log in to NVIDIA's container registry

export NGC_API_KEY="nvapi-...your-key..."
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Enter fullscreen mode Exit fullscreen mode

The literal username $oauthtoken is correct — that's NGC's convention for API-key logins. Don't substitute anything for it.


Step 2 — Pull and run the NIM container

docker run -it --rm \
  --name llama-3.1-8b-instruct \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$HOME/.cache/nim:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

Enter fullscreen mode Exit fullscreen mode

A few notes:

  • First run is slow. The image is large and the model weights download on first launch. The -v cache mount means subsequent runs are fast.
  • Use the exact image tag from the model's Deploy tab on build.nvidia.com. The example above uses :latest, but pinning a specific version is safer for reproducibility.
  • The container listens on port 8000. That's what -p 8000:8000 exposes to your host.

When the container finishes loading it will log something like Application startup complete. Uvicorn running on http://0.0.0.0:8000. That's your signal that the OpenAI-compatible endpoint is live.


Step 3 — Verify the endpoint with curl

curl http://localhost:8000/v1/models

Enter fullscreen mode Exit fullscreen mode

You should see a JSON response listing the loaded model. If curl hangs or returns connection-refused, the container hasn't finished loading yet — give it another minute and try again.


Step 4 — Point the Python client at localhost

This is the entire Python change.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:8000/v1',          # ← was 'https://integrate.api.nvidia.com/v1'
    api_key='not-needed-for-local-dev',           # local NIM doesn't validate the key
)

MODEL = 'meta/llama-3.1-8b-instruct'              # same model name as the hosted endpoint

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

print(ask(
    system_prompt='You are a concise USC campus assistant.',
    user_message='What does NVIDIA NIM stand for?',
))

Enter fullscreen mode Exit fullscreen mode

Two lines changed — base_url and api_key. The ask() function is the same one we've been using since Part 1. The campus assistant, the embedding retriever, and the guardrail layers from Parts 2 and 3 all run against this client without any further changes.

The repo's part4_local_nim.py reads NIM_BASE_URL from your environment so the same script runs against the hosted endpoint by default and against local NIM when you set the env var. That makes it easy to A/B the two.


Step 5 — Same code, two endpoints (the test that matters)

# Hosted run (what we've done in Parts 1-3)
python3 part4_local_nim.py

# Local NIM run — point the same script at the container
NIM_BASE_URL=http://localhost:8000/v1 python3 part4_local_nim.py

Enter fullscreen mode Exit fullscreen mode

Both should produce the same shape of output — the same ask() call, the same model name, just inference happening in a different place. That's the whole point of an OpenAI-compatible API surface — the application code stops caring where the model lives.


When to use which

Situation Use
Workshop, prototype, demo, course project Hosted (integrate.api.nvidia.com)
Sensitive data that can't leave a controlled environment Local NIM on cluster GPU
Latency-critical inner loop, large concurrent load Local NIM on a sized-up node
First-time student, no GPU on hand Hosted (don't even mention local until they ask)
Production with a known traffic profile Either, depending on cost crossover

There is no "winner" here. The hosted API and self-hosted NIM are the same product with different deployment footprints. The thing worth internalizing — and what this post is really about — is that your Python code does not have to care.


Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for the hosted version: Open part4_local_nim.ipynb
Local Python: part4_local_nim.py in the repo. Defaults to the hosted endpoint; set NIM_BASE_URL=http://localhost:8000/v1 to point at a local NIM container.

MIT licensed. I run this at USC against both endpoints — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.


The full series

  • Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
  • Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
  • Part 3: Add Guardrails So Your AI App Doesn't Lie
  • Part 4 (this post): Run NVIDIA NIM on Your Own GPU
  • Part 5: From Chatbot to Agent — tool calling with NIM. Give the model two tiny Python tools, watch it decide which to call. The retriever and the guardrails from earlier parts become tools the agent can reach for.

Follow this series on dev.to (the series widget at the top of each post lists every published part in order).