惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

DEV Community

How I built a dependency risk scanner with Coral in 7 days 2487. Remove Nodes From Linked List C_STD : A Leak-Free, Cross-Platform Standard Library for Modern C How to build your professional network as a developer — authentic strategies The Pope and the Dynamo Building ShouldWeAutomate: A Decision Intelligence Platform for Workflow Automation The Reputation Layer: Why Developers Quietly Run Corporate PR The Last Mile of Software Is a Sentence AppView 1.0.0 Released: Instrument and Secure Your LLM Deployments The Hermes Rescue: How an Open Agent Rebuilt My GitHub Projects from Scratch S2 — Heap Corruption Crashes: How to Diagnose and Fix Them I built a Chrome extension because I couldn't stop opening Twitter between Pomodoro sessions AI cheating in technical interviews is invisible to interviewers — here's how we detect it Lean4 Might Be the Missing Piece in AI: Why Theorem Provers Are Suddenly Everywhere The Zero-Drift API Series: Stop Trusting a Green Build You Can't Explain How I Deployed My First Project on AWS (And Didn't Break Everything) How I Built a Real-Time Quiz Platform with Next.js, WebSockets, and Learning Science When Your VPS Blocks Outbound SMTP: What Actually Helps Los agentes de código necesitan memoria durable, no solo contexto Cognitive Architectures of AGI: 7 Patterns That Transform LLMs from Oracles into Thinkers I Built a Chat App That Deletes Itself (Because I Was Bored at 2am) Uncovering the Power of Linux's History Command How to Add a Contact Form to Your Ghost Blog Accept Payments in Minutes with Afriex Checkout Sessions Hermes Agent Gets Smarter Every Day. So Does the Bill. How I get Next.js sites to load almost instantly — a practical checklist Treasure Hunt Engine: Why One Bad Prometheus Rule Sank the Whole Veltrix Event Test a DNS Leak in 2 Minutes: Complete Methodology + Per-OS Fixes (2026) Lessons from building a Chrome extension Rivet: A library i made in 2 days I Built a Speech-to-Text Tool Because Sometimes Typing Just Gets in the Way How I'm Building a Multi-Agent Crew for AI Coding Supervision (Cipher Update) Your AI Agent Needs a Manager, Not a Superhero I Built CausalLens — A Free, Open-Source Causal Impact Calculator for Time Series (5 Methods, Zero Setup) How to write good commit messages and pull requests — a team guide Cipher: The Jarvis with a Hermes Core How to build a second brain with Obsidian and Claude Code (step by step) Claude completed my MPI assignment. Then it couldn't run it. So I built the missing piece. This 100% How Our Document Ingestion Pipeline Turns Files into LLM-Ready Markdown Agentic AI Model Risk Management: Aligning with Regulatory Expectations CTV Fraud Has an IPv6 Business Problem The great AI enshittification The Veltrix Treasure Hunt Engine: Why Our First Rewrite Cost Us 3.2 Million Requests Per Second I Made My AI Models Argue, Then Let Hermes Be the Judge Road To KiwiEngine #4: The Racecar Driver Analogy Run Aider on Ollama, Bedrock, or Any LLM Provider — One Gateway, Every Model BAIXAR VÍDEO DO YOUTUBE Releasing HeliosProxy, The programmable Postgres data-plane Hello, DEV Community! 👋 Three Bitcoin Primitives That Don't Exist Anywhere Else (PoW Beacon, DLC Oracle, Fair-Launch Rune) Append-only doesn't mean what you'd hope Notes from the Mistral AI Now Summit Are Claude skills safe in 2026? What the Snyk ToxicSkills audit actually found How to not Lose $500M via API Bills: Run Private AI for 100 Engineers Under $1 Million The Unlikely Journey from Bricks to Bytes Three TODOs, three weeks, one weekend: finishing pq v0.14 Server-Side WebRTC Noise Reduction with Pion, FFmpeg, and RNN Models Autonomous AI Agents in Cryptocurrency Portfolio Management IDOR BugBounty Labs: 5 Realistic Challenges to Master Insecure Direct Object Reference IDOR Lab: The Bug Bounty Training Platform That Doesn't Hold Your Hand ZentriqGuard — Hermes Agent-Powered Zero-Trust Access Auditor Why Artistic QR Codes Silently Fail (And How I'm Trying to Fix It) How I Built and Monetized a Currency Exchange Rate API with FastAPI, Deployed it on Render, and Published it on RapidAPI. The 7 Best Reddit Scrapers in 2026 (Free & Paid, Tested) An AI runs my company. A solo dev vibe-coded $15K in a week — we made $[X]. A cold autopsy. I am new here Stop Pasting Your Code Into ChatGPT For Debugging—Run LLMs Locally Instead 5 Free JSON Tools Every Developer Should Bookmark Building reqlog: a Go CLI for tracing request flows across logs (files, Docker, SSH) Environment Variables in Node.js — What They Are, How dotenv Works, and Why Getting This Wrong Can Ruin You I Built a Zero-Dependency Discord.js Package That Creates Temporary Voice Channels Automatically Goodbye CSV Nightmares: Automating Magento Order Line Item Exports in Google Sheets Nexthena — A Local-First Whiteboard App Built on Excalidraw How we built an platform to solve the "finding a photographer" problem 5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered) From Logic to Numbers: A Beginner’s Guide to Programming Through Mathematical Thinking Oracle Fusion Report Scheduling with Skip Conditions AtCoder Beginner Contest 460 参加記録と解答例 (A D問題) Your AI Agent Just Crashed at Step 9 of 12. Here's How to Make That Not Matter. Grokking the System Design Interview: Why the Original Course Still Wins Outbox Pattern Solves Publishing. Inbox Pattern Solves Processing. Why autism hasn't disappeared — a hypothesis Por que eu parei de usar Cloudinary e construí minha própria API de imagens How to Test if Your Proxy is Leaking DNS: 2026 Setup Guide AWS VPC Networking — Public Subnet, Private Subnet ve 3-Tier Mimari MediaNote: a note-taking app inside VS code I built a sovereign self-healing AI development system from scratch using Hyperdimensional Computing — no LLMs, no cloud, no APIs WordPress vs. Next.js: benchmark real pe Core Web Vitals (și de ce plugin-urile de cache nu rezolvă problema) ai, deepseek, machinelearning I Gave My Dead Raspberry Pi to an AI Agent. It Fixed Everything Over SSH. How I Built a Google Shopping Scraper with Python & Playwright I Turned Hermes Agent into a Verifiable Agent Operating System The 5 Systematic Failure Modes of AI Research Reports (and How to Catch Them) Stop Saying 'Great!'—Build a Real AI Interview Coach with Claude Code Simple SQL Tool What is DevOps? A Plain English Guide for Beginners Why ChatGPT sucks at generating Types (and how I fixed it) Modelling a codebase as a requirements ontology in Neo4j, keeping AI coding agents oriented AI Is Doing the Work of Junior Developers — And Nobody Is Talking About What Happens in 7 Years
Local-first: a Model on Your Own Machine, Zero Cloud
Dale Nguyen · 2026-05-31 · via DEV Community

This is the concrete, runnable walkthrough for Post 1 of the Portway series. The goal: stand up a single model behind an OpenAI-compatible endpoint on hardware you already own, call it from the official OpenAI SDK, and internalize the stateless contract. Everything here runs locally for $0.

What this post covers

  • A demo.py script with two blocks:
    1. Round-trip — one chat call via the OpenAI SDK, printing the content and the usage object.
    2. Stateless proof — the same final question sent as a 1-turn message and as the last turn of a 5-turn fabricated history; both prompt_tokens values are printed alongside an explanation of the delta.

Engine choice on this machine

Apple Silicon Mac, 48 GB unified memory, Ollama already installed. The demo uses Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1 and the gpt-oss:20b model (~14 GB).

The wider Portway series uses llama.cpp on Mac (Ollama is called out as problematic for Qwen3.5 in Post 2). For Post 1 — one model, prove the contract — Ollama is fine and already on the box.

Model options by available RAM

The demo script works with any Ollama-served model — just substitute the model name in demo.py. The table below covers machines from 9 GB unified memory upward.

Model Pull command Approx size Min RAM Notes
llama3.2:3b ollama pull llama3.2:3b ~2 GB 8 GB Fastest; good for testing the contract
gemma3:4b ollama pull gemma3:4b ~3 GB 8 GB Google; solid instruction-following
mistral:7b ollama pull mistral:7b ~4.1 GB 8 GB Classic 7B baseline
llama3.1:8b ollama pull llama3.1:8b ~4.7 GB 9 GB Best quality under 10 GB
qwen2.5:7b ollama pull qwen2.5:7b ~4.4 GB 9 GB Strong at instruction + reasoning
gpt-oss:20b ollama pull gpt-oss:20b ~14 GB 24 GB Used in this post's sample output

On a 9 GB machine, replace gpt-oss:20b in demo.py with llama3.1:8b or qwen2.5:7b — the contract demonstration is identical.

Prerequisites

  • Ollama running locally (curl -s http://localhost:11434/api/tags should return JSON)
  • uv installed (uv --version)
  • The model pulled. This post uses gpt-oss:20b (requires ~24 GB RAM); see Model options by available RAM for lighter alternatives on 9 GB+ machines.
ollama pull llama3.2:3b

Enter fullscreen mode Exit fullscreen mode

Run it

From the repo root:

uv sync                                  # creates .venv at root, installs deps
uv run --project 1-local-first python 1-local-first/demo.py

Enter fullscreen mode Exit fullscreen mode

Sample output

A real run on this machine (M4-class Mac, 48 GB, gpt-oss:20b via Ollama). Numbers will differ with smaller models — prompt_tokens for the same input stays deterministic regardless of model:

============================================================
Block 1 — round-trip via OpenAI SDK against localhost
============================================================
content: Toronto, Vancouver, Montreal.
usage:   CompletionUsage(completion_tokens=43, prompt_tokens=72, total_tokens=115, ...)

============================================================
Block 2 — same final question, 1-turn vs 5-turn history
============================================================
1-turn response: The capital of Canada is **Ottawa**.
5-turn response: The capital of Canada is **Ottawa**, located in the province of Ontario.

1-turn prompt_tokens: 75
5-turn prompt_tokens: 139
delta:                64

Why the delta exists: the server holds NO conversation state between
requests. The 5-turn call's prompt_tokens is higher only because the
client re-sent the full history in the request body. Each call is
evaluated from scratch — history is the client's responsibility.

Enter fullscreen mode Exit fullscreen mode

completion_tokens and the response text will vary run-to-run (sampling is non-deterministic at default temperature). prompt_tokens for the same input is deterministic — 75 and 139 should reproduce.

Notice how the 5-turn response picks up the road-trip context ("located in the province of Ontario") while the 1-turn answer riffs on the bare "Driving." in its prompt — same model, different framing in the client-supplied messages.

The stateless contract, explained

This is the most important concept in the series. Every request to an LLM API — local or cloud — is evaluated from scratch. The server has no memory of previous turns. When you send a multi-turn conversation, you are the one re-sending the full history in the request body. The model sees it all at once.

The server's only "memory" between requests is the prefix cache (a compute optimisation that avoids re-evaluating tokens it has seen before), never conversation state. The cache is invisible to you — from the API contract's perspective, each call is stateless.

Understanding this is the foundation for everything that follows in the series:

  • Why conversation management belongs in the client, not the server
  • Why context windows matter for cost and latency
  • Why streaming usage requires an explicit opt-in (stream_options.include_usage)

Definition of done

  • OpenAI SDK round-trips against localhost — Block 1 prints a real content and a usage object.
  • Can explain why 5 turns vs 1 turn changes prompt_tokens while the server remembers nothing — Block 2 prints both numbers and the one-paragraph explanation.

Things worth noting now

Context size eats RAM/VRAM. Ollama's default context window is conservative for most models; raising it (e.g. ollama run llama3.2:3b/set parameter num_ctx 32768) costs unified memory. It was not changed for this post.

gpt-oss emits a reasoning channel (Harmony format). The engine applies the template; you still get a normal message.content. The reasoning channel will be segregated at the gateway in Post 3.

No streaming yet. Post 5 covers the streaming usage trap — you must opt in via stream_options.include_usage, otherwise usage is null in streamed responses.

What's next

Post 2 moves from a single model to running multiple models simultaneously and routing requests between them — the first step toward a real local gateway.

The full series and all demo code live in the Portway repository.