Tesla P40 in a Homelab: 24GB of Inference on a Budget

The Tesla P40 is a seductive piece of hardware: 24GB of VRAM for a fraction of the cost of a modern RTX card. But after three weeks of fighting with it, I realized that the "budget" part of the equation doesn't include the cost of my sanity. I spent more time debugging QEMU assertion errors and PCI address shifts than I did actually running models.

If you're looking to put a P40 in a Proxmox node to run LLMs, you're likely trying to fit larger models like Qwen2.5:32B into VRAM without spending four figures on an A100 or a 3090. It's a viable path, but the standard way of doing things (GPU passthrough to a VM) is a recipe for instability with this specific card.

The Passthrough Trap

My first instinct was to follow the standard Proxmox pattern: isolate the GPU using vfio-pci and pass it through to a dedicated Ubuntu VM. I've done this before, and usually, it's the right move for isolation. I had my IOMMU groups sorted and the hostpci line configured in the VM config.

It worked for about four hours. Then the P40 decided it didn't want to exist anymore.

The Tesla P40 lacks Function Level Reset (FLR). In a virtualized environment, this means that if the VM crashes or the driver hangs, the GPU doesn't actually reset. The next time you try to boot the VM, you get a QEMU assertion error or a "Device is already in use" message. I found myself hard-rebooting the entire physical node just to get the GPU to respond again. I've written about GPU passthrough gotchas before, but the P40 is particularly aggressive about breaking the happy path.

I also hit the PCI address instability issue. After a few reboots and some BIOS tweaks, the card shifted addresses, and my VM config became a lie. I was essentially playing a game of whack-a-mole with my hardware topology.

The Solution: Host-Level Inference

I stopped trying to be "architecturally clean" and decided to run the GPU directly on the Proxmox host. I know, running production-ish workloads on the hypervisor is usually a sin, but the P40 is too unstable in a VM to justify the overhead.

Here is exactly how I moved from a broken passthrough setup to a stable host-level inference engine.

1. Cleaning the Slate

First, I stripped the GPU out of the VM and killed the VFIO isolation. If you've already pinned your GPU to vfio-pci, you need to undo that.

# Remove the PCI device from the VM config
qm set <VM_ID> --hostpci0 ''

# Blacklist vfio to stop it from grabbing the card at boot
echo "blacklist vfio_pci" | sudo tee /etc/modprobe.d/vfio.conf
echo "blacklist vfio" | sudo tee -a /etc/modprobe.d/vfio.conf

# Update initramfs and reboot
update-initramfs -u
reboot

2. Host Driver Installation

I installed the NVIDIA 535 drivers directly on the Proxmox host. I chose 535 because it's stable with the P40's Pascal architecture.

sudo apt update
sudo apt install nvidia-driver-535
# Verify the card is seen and the driver is loaded
sudo nvidia-smi

3. Deploying Ollama as a Systemd Service

Instead of wrapping Ollama in a container on the host (which adds another layer of driver mapping pain), I deployed it as a systemd service. This ensures it starts on boot and has direct access to the GPU without runtime overhead.

I created a service file at /etc/systemd/system/ollama.service:

[Unit]
Description=Ollama
After=network.target

[Service]
User=ollama
Group=ollama
WorkingDirectory=/opt/ollama
ExecStart=/opt/ollama/ollama serve
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_KEEP_ALIVE=30s"
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

I set OLLAMA_HOST=0.0.0.0 so my other nodes in the cluster could hit the API, and OLLAMA_KEEP_ALIVE=30s to ensure the model unloads from VRAM quickly when not in use, leaving room for other tasks.

The VRAM Reality Check

With 24GB of VRAM, the P40 is a beast for its age, but it's not infinite. When I tried running Qwen2.5:32B, I noticed a massive performance drop as soon as the context window grew.

The issue isn't the model weights; it's the KV cache. If you allocate almost all 24GB to the model weights, there's no room left for the "memory" of the conversation. This leads to the model hallucinating or simply timing out.

To fix this, I had to use a more aggressive quantization (4-bit) and limit the context window. If you're running these models for AI agent orchestration, you need to be careful with the system prompts. A massive system prompt eats into your available VRAM before the first token is even generated.

Monitoring the Blind Spot

The biggest problem with running a GPU on the host is that you lose the visibility you get in a managed Kubernetes environment. nvidia-smi is great for a quick check, but it's useless for long-term stability monitoring.

I deployed nvidia_gpu_exporter as a DaemonSet on my Kubernetes cluster, but since the GPU is now on the host, I had to run the exporter as a standalone binary on the Proxmox node to feed metrics into my Prometheus instance.

If you're still using K8s for your GPU workloads, the standard NVIDIA device plugin isn't enough for real monitoring. You need the exporter to see things like temperature and power draw. For the P40, this is critical because it's a passive card. If your fans aren't dialed in, it will thermal throttle in seconds.

For those running the exporter in K8s, here is the manifest I use:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: nvidia-gpu-exporter
  template:
    metadata:
      labels:
        app: nvidia-gpu-exporter
    spec:
      containers:
      - name: exporter
        image: nvidia/gpu-exporter:latest
        ports:
        - containerPort: 9835
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

Why This Actually Works

The reason the host-level approach wins is simple: it eliminates the translation layer. When you pass a GPU through, you're relying on the IOMMU and the hypervisor to handle memory mapping and interrupts. The P40's lack of FLR means that any failure in that chain is permanent until a cold boot.

By running on the host, the NVIDIA driver has a direct line to the hardware. If the driver crashes, you can often reload the kernel module without rebooting the entire machine. It's a trade-off: you lose the "clean" separation of a VM, but you gain a system that actually stays online.

Lessons Learned

If I had to do this again, I would have skipped the VM phase entirely. The documentation for Proxmox GPU passthrough is great for cards that support FLR, but it's misleading for older Tesla cards.

A few other things to watch out for:

Cooling is not optional. The P40 is designed for server chassis with high-static pressure fans. In a homelab case, you need a 3D-printed shroud and a high-RPM fan bolted directly to the heatsink. If the card hits 80C, your tokens-per-second will plummet.
Driver Mismatches. I hit a wall where nvidia-smi failed after a Proxmox kernel update. This usually happens when the kernel module is updated but the userspace libraries are out of sync. Always check your dkms status after a dist-upgrade.
VRAM is the only metric that matters. Don't get distracted by CUDA core counts. For inference, the 24GB VRAM is the only reason to buy this card. If you can afford a 3090, buy the 3090. The P40 is for those of us who want the most VRAM for the least amount of money and are willing to fight the OS to get it.

The P40 is a fantastic way to get into local LLMs, provided you're okay with treating your hypervisor as a workstation. It's not the "correct" way to build a cluster, but it's the way that actually works.

推荐订阅源

DEV Community