





















I set up an inference server so I can hit my own open-weight models from my laptop anywhere, with nothing exposed to the public internet. Sharing in case it's useful to others, and to hear how people are doing this differently.
The request path: Client (laptop on Tailscale) → Tailscale Aperture (AI gateway — auth + routes by model name) → llama-swap → vLLM → GPU
What I like about it: - Access runs over Tailscale, so it's end-to-end encrypted and gated by OAuth. No open ports and no reverse proxy to babysit.
- llama-swap loads models on demand: if the requested model isn't running, it starts a vLLM child process, and if a model sits idle for ~5 min, it kills it to free VRAM. Useful when juggling models on one box.
- vLLM handles inference (currently Qwen3.6 27B).
I can also just SSH in to work directly on the GPU — adding models, fine-tuning, and so on.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。