惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

DEV Community

Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed I asked three AI models the same API question. Only one had it right. Implementing Saga Pattern With Lambda Durable Function Why does AI forget what you said (and how to fix it) I built a daily Wordle-style game for AI tools - Here's how Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses
Tesla P40 in a Homelab: 24GB of Inference on a Budget
Guatu · 2026-05-26 · via DEV Community

The Tesla P40 is a seductive piece of hardware: 24GB of VRAM for a fraction of the cost of a modern RTX card. But after three weeks of fighting with it, I realized that the "budget" part of the equation doesn't include the cost of my sanity. I spent more time debugging QEMU assertion errors and PCI address shifts than I did actually running models.

If you're looking to put a P40 in a Proxmox node to run LLMs, you're likely trying to fit larger models like Qwen2.5:32B into VRAM without spending four figures on an A100 or a 3090. It's a viable path, but the standard way of doing things (GPU passthrough to a VM) is a recipe for instability with this specific card.

The Passthrough Trap

My first instinct was to follow the standard Proxmox pattern: isolate the GPU using vfio-pci and pass it through to a dedicated Ubuntu VM. I've done this before, and usually, it's the right move for isolation. I had my IOMMU groups sorted and the hostpci line configured in the VM config.

It worked for about four hours. Then the P40 decided it didn't want to exist anymore.

The Tesla P40 lacks Function Level Reset (FLR). In a virtualized environment, this means that if the VM crashes or the driver hangs, the GPU doesn't actually reset. The next time you try to boot the VM, you get a QEMU assertion error or a "Device is already in use" message. I found myself hard-rebooting the entire physical node just to get the GPU to respond again. I've written about GPU passthrough gotchas before, but the P40 is particularly aggressive about breaking the happy path.

I also hit the PCI address instability issue. After a few reboots and some BIOS tweaks, the card shifted addresses, and my VM config became a lie. I was essentially playing a game of whack-a-mole with my hardware topology.

The Solution: Host-Level Inference

I stopped trying to be "architecturally clean" and decided to run the GPU directly on the Proxmox host. I know, running production-ish workloads on the hypervisor is usually a sin, but the P40 is too unstable in a VM to justify the overhead.

Here is exactly how I moved from a broken passthrough setup to a stable host-level inference engine.

1. Cleaning the Slate

First, I stripped the GPU out of the VM and killed the VFIO isolation. If you've already pinned your GPU to vfio-pci, you need to undo that.

# Remove the PCI device from the VM config
qm set <VM_ID> --hostpci0 ''

# Blacklist vfio to stop it from grabbing the card at boot
echo "blacklist vfio_pci" | sudo tee /etc/modprobe.d/vfio.conf
echo "blacklist vfio" | sudo tee -a /etc/modprobe.d/vfio.conf

# Update initramfs and reboot
update-initramfs -u
reboot

Enter fullscreen mode Exit fullscreen mode

2. Host Driver Installation

I installed the NVIDIA 535 drivers directly on the Proxmox host. I chose 535 because it's stable with the P40's Pascal architecture.

sudo apt update
sudo apt install nvidia-driver-535
# Verify the card is seen and the driver is loaded
sudo nvidia-smi

Enter fullscreen mode Exit fullscreen mode

3. Deploying Ollama as a Systemd Service

Instead of wrapping Ollama in a container on the host (which adds another layer of driver mapping pain), I deployed it as a systemd service. This ensures it starts on boot and has direct access to the GPU without runtime overhead.

I created a service file at /etc/systemd/system/ollama.service:

[Unit]
Description=Ollama
After=network.target

[Service]
User=ollama
Group=ollama
WorkingDirectory=/opt/ollama
ExecStart=/opt/ollama/ollama serve
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_KEEP_ALIVE=30s"
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enter fullscreen mode Exit fullscreen mode

I set OLLAMA_HOST=0.0.0.0 so my other nodes in the cluster could hit the API, and OLLAMA_KEEP_ALIVE=30s to ensure the model unloads from VRAM quickly when not in use, leaving room for other tasks.

The VRAM Reality Check

With 24GB of VRAM, the P40 is a beast for its age, but it's not infinite. When I tried running Qwen2.5:32B, I noticed a massive performance drop as soon as the context window grew.

The issue isn't the model weights; it's the KV cache. If you allocate almost all 24GB to the model weights, there's no room left for the "memory" of the conversation. This leads to the model hallucinating or simply timing out.

To fix this, I had to use a more aggressive quantization (4-bit) and limit the context window. If you're running these models for AI agent orchestration, you need to be careful with the system prompts. A massive system prompt eats into your available VRAM before the first token is even generated.

Monitoring the Blind Spot

The biggest problem with running a GPU on the host is that you lose the visibility you get in a managed Kubernetes environment. nvidia-smi is great for a quick check, but it's useless for long-term stability monitoring.

I deployed nvidia_gpu_exporter as a DaemonSet on my Kubernetes cluster, but since the GPU is now on the host, I had to run the exporter as a standalone binary on the Proxmox node to feed metrics into my Prometheus instance.

If you're still using K8s for your GPU workloads, the standard NVIDIA device plugin isn't enough for real monitoring. You need the exporter to see things like temperature and power draw. For the P40, this is critical because it's a passive card. If your fans aren't dialed in, it will thermal throttle in seconds.

For those running the exporter in K8s, here is the manifest I use:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: nvidia-gpu-exporter
  template:
    metadata:
      labels:
        app: nvidia-gpu-exporter
    spec:
      containers:
      - name: exporter
        image: nvidia/gpu-exporter:latest
        ports:
        - containerPort: 9835
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

Enter fullscreen mode Exit fullscreen mode

Why This Actually Works

The reason the host-level approach wins is simple: it eliminates the translation layer. When you pass a GPU through, you're relying on the IOMMU and the hypervisor to handle memory mapping and interrupts. The P40's lack of FLR means that any failure in that chain is permanent until a cold boot.

By running on the host, the NVIDIA driver has a direct line to the hardware. If the driver crashes, you can often reload the kernel module without rebooting the entire machine. It's a trade-off: you lose the "clean" separation of a VM, but you gain a system that actually stays online.

Lessons Learned

If I had to do this again, I would have skipped the VM phase entirely. The documentation for Proxmox GPU passthrough is great for cards that support FLR, but it's misleading for older Tesla cards.

A few other things to watch out for:

  1. Cooling is not optional. The P40 is designed for server chassis with high-static pressure fans. In a homelab case, you need a 3D-printed shroud and a high-RPM fan bolted directly to the heatsink. If the card hits 80C, your tokens-per-second will plummet.
  2. Driver Mismatches. I hit a wall where nvidia-smi failed after a Proxmox kernel update. This usually happens when the kernel module is updated but the userspace libraries are out of sync. Always check your dkms status after a dist-upgrade.
  3. VRAM is the only metric that matters. Don't get distracted by CUDA core counts. For inference, the 24GB VRAM is the only reason to buy this card. If you can afford a 3090, buy the 3090. The P40 is for those of us who want the most VRAM for the least amount of money and are willing to fight the OS to get it.

The P40 is a fantastic way to get into local LLMs, provided you're okay with treating your hypervisor as a workstation. It's not the "correct" way to build a cluster, but it's the way that actually works.