Running local models is good now

Lobsters

CIFSwitch: a non-universal Linux local root vulnerability RIPE NCC session fixation: poaching logins with an Atlas probe GNOME 2.20 but its Web Components Agentic Search for Context Engineering – Leonie Monigatti Garnix is shutting down [not OC] akashina.tngl.sh/jjc Concerning Emacs (and Jazz) Nitpicking the shell history scene in ‘Tron: Legacy’ What's cooking on SourceHut? Q2 2026 The tenth OpenPGP email summit Package managers that package package managers Clojure on Fennel part three: parsing WordPress at 23 Finding Miscompiles for Fun, Not Profit GitHub - creusot-rs/creusot: Creusot helps you prove your Rust code is correct. Announcing Rust 1.96.0 | Rust Blog A Love Letter to Neovim sqlite AGENTS.md Am I a Bad Friend? CSS vs. JavaScript • Josh W. Comeau Erlang Ecosystem Foundation - Supporting the BEAM community A brief note about slot access cost in Common Lisp Keyboard latency probe Rethinking the GNOME clipboard issues Back to the Building Blocks’ Building Blocks Tech Notes: Theseus: translating win32 to wasm Fast is better than slow Content-addressed Rust builds (or, what kache actually caches) Intent to Prototype: Embedding API Canada’s Bill C-22 and the security cost of collecting more data 5 PostgreSQL locking behaviors that trip people up okmij.org Stop advertising in your commits! | AksDev GitHub - mplsllc/macsurf: A modern web browser for Classic Mac OS 9 PowerPC. Real CSS3, ES5 JavaScript, native HTTPS — built with CodeWarrior on the Carbon API. Introducing DoomBench - Can Your Data Stack Run DOOM? What are some of your favourite developer tools? Building a Scalable Ingestion Pipeline with Temporal (Part 1) Converting shallow Git bundles into normal repositories Are you a member of any professional associations? What is a harmonic? An interactive comic about additive synthesis How Virtual Tables Work in the Itanium C++ ABI Using SwiftUI to Build a Mac-assed App in 2026 Rust (and Slint) on a jailbroken Kindle. ~jack/lambda-on-lambda - Serverless Haskell on AWS - sourcehut git Human proof for FOSS contributions Extremely simple internet radio controlled via IRC Announcing BABLR Splitting Konsole views from Helix to run tools | AksDev GitHub - yugr/rust-slides Serving files over HTTP three ways: synchronous, epoll, and io_uring update docs with information about building with build.py (#979) · astral-sh/python-build-standalone@c9c40c5 A Simple Makefile Tutorial On C extensions, portability, and alternative compilers Switching to Colemak | Pedro Alves Just How Bad Was The Intel IAPX432? Nix's Substituter List Is Not a Routing Table Accelerating copy_if using SIMD Lambda on Lambda: Serverless Haskell on AWS | Blog Announcing feed-repeat v1.0 Scaling Akvorado BMP RIB with sharding EYG news: A host of CLI improvements, new guides and new effects The social contract of writing JS Crossword C array types are weird; and related topics Flatpak will depend on systemd – OSnews Migrating from Go to Rust | corrode Rust Consulting A portentous reunion Vivado Licensing Options How my minimal, memory-safe Go rsync steers clear of vulnerabilities the entropy layer of a wavelet codec, on its own GitHub - nferhat/fht-compositor: A dynamic tiling Wayland compositor. Debian SE Linux and PinTheft Does bulk memmove speed up std::remove_if? (No.) 声明式部分更新 | Blog | Chrome for Developers Fully in-browser container builds Dianne Skoll's Web Site - Remind The Architecture of Open Source Applications (Volume 1)Berkeley DB Pardon MIE? - ironPeak Blog “Long-Term Support” doesn’t mean what you think Jira IS Turing-Complete May I recommend thinking of Emacs as your Fortress of Solitude hershey Floodgap Gopher-HTTP gateway gopher://thelambdalab.xyz/1cuneiforth/ HP QuickWeb, Singular And Pointless That one time I used Go panics for flow control A new suite of modern tools coming for editing and publishing RFCs From the Tabletop… The Digital Antiquarian Building a Host-Tuned GCC to Make GCC Compile Faster Are we self-sovereign PKI yet? Claw Patrol: an open-source security firewall for agents | Deno Revised^7 Report on Scheme, Large: Procedural Fascicle Draft is now public A Network Allow-List Won't Stop Exfiltration — André Graf From AFSK to Goertzel – µArt.cz Software For My New Home Server Introducing Neptune: Direct3D virtualization for QEMU AI Agent Bankrupted Their Operator While Trying to Scan DN42 - Lan Tian @ Blog mimalloc: A new, high-performance, scalable memory allocator for the modern era Making wl_shm fast The Soul of Maintaining a New Machine - Third Draft | Books in Progress What is Git made of?

jfb · 2026-06-16 · via Lobsters

I’ve been working with local models since they came out, and finally, they’re surprisingly good now.

I have a 2022 M2 Mac with 64 GB RAM and 1TB storage and I’ve used

Mistral 7B
Gemma 3
OpenAI OSS-20B
Qwen 3 MOE, as well as a number of other Qwen variants like Qwen 2.5 Coder

across a lot of different system setups like

raw llama.cpp with Open WebUI
llama-cpp-python
Ollama
llamafiles and
LM Studio

Where are local models now?

Early on, models were slow, hard to use, and just not that accurate for most programming tasks. The idea that local models were severely lagging behind was largely true until, for me, the release of GPT-OSS. I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.

As a result, I’ve mostly been using local models as fast, personalized Google for development questions that don’t require recency.

But with the most recent releases from Google in the Gemma 4, family, I’ve finally been able to do agentic coding locally and have loops work at about ~75% the accuracy/speed of frontier models, which is incredible.

I’ve so far been using gemma-4-26b-a4b LM Studio implementation as my default local model. I’ve used the local setup so far to: Refactor a Python script that was a notebook into a repo of 5-6 modules, lint that module to use correct type hints for generics (most frontier models now do this automatically, but not always).

I’ve also used it to proofread some blog posts, write unit tests, and to bootstrap a repo that stands up a two-tower model for recommendations just to see what the agent would do with a blank slate. Here’s what it generated, which was pretty basic but still beyond the scope of anything I would have thought possible last year:

Note that the environment is restricted because I run all my agentic workflows in a Docker container with limited access to execution.

I’m also building an app that surfaces trending topics from Arxiv papers. Out of curiosity, I had Pi go through my past LM Studio session logs and figure out what I was using LM Studio for:

Unsurprisingly, since I’ve been working on Rijksearch,

None of these are groundbreaking tasks (again, a lot of personalized Google/docs lookups), and working on them does give my GPUs and RAM a workout and the K-V cache grows to 64 GB RAM.

But, the larger story for me is that these kinds of tasks, even as simple as they are, used to be impossible for local models as recently as 6 months ago.

Gemma-4-12b-qat just came out but I’ve already also really been impressed with its performance relative to its size. The model architecture itself is really interesting and proposes a bunch of interesting questions like, “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

Running agentic models locally today

But don’t take my word for any of this, try it out for yourself! You’ll need a local model inference engine, an agentic harness, and the local model artifact if you want to try to run local agentic flows. You’ll need to set up the harness to point at your local inference endpoint, the downloaded model artifact served via the inference engine.

For my local setup, I’m currently using Pi as the agent harness and LM Studio as the inference server, although it would likely be faster if I just used llama.cpp directly - a potential direction for a future experiment.

This post was very easy to follow to set up agentic coding with Pi and LM Studio, although I did make a few tweaks to the post’s setup.

Model: The post recommends Gemma 26B A4B , but gemma-4-12b-qat is more recent and smaller and faster, without much sacrifice in accuracy.
Security: I run every Pi session in a Docker container and give it permissions only to bash so that it can’t run Python code or do web browsing, although I do plan to allow curl in a different image for some research work I’m doing.
Agent Harness Config: Since I run everything in Docker, I edited Pi’s models.json in order to get Pi to talk to the model.

"lmstudio": {
      "baseUrl": "http://host.docker.internal:1234/v1",
      "api": "openai-completions",
      "apiKey": "not-needed",
      "models": [
        {
          "id": "google/gemma-4-12b-qat",
          "input": [
            "text",
            "image"
          ]
        }
      ]
    }

Here’s my Docker Compose config:

services:
  pi:
    build:
      context: .
      dockerfile: Dockerfile
    image: pi-agent:0.74.0
    init: true
    stdin_open: true
    tty: true
    extra_hosts:
      - "host.docker.internal:host-gateway"
    environment:
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:-}
      OPENAI_API_KEY: ${OPENAI_API_KEY:-not-needed}
      GEMINI_API_KEY: ${GEMINI_API_KEY:-}
      OPENAI_API_BASE: ${OPENAI_API_BASE:-http://host.docker.internal:1234/v1} # note that you'll need to specify a base if you also use OpenAI to access OpenAI's actual completions endpoint
      WHATEVER_API_KEY: ${WHATEVER_API_KEY:-}
    volumes:
      - ${HOME}/.pi/agent/models.json:/config/models.json
      - ${WORKSPACE:-.}:/workspace
      - pi-config:/config
      - pi-sessions:/sessions
    working_dir: /workspace

volumes:
  pi-config:
  pi-sessions:

and here’s the bash script that runs pi .

#!/usr/bin/env bash

# Pi — Start the containerized Pi agent.

# Directory containing this script and the compose files.
SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# Workspace to mount into the container. 
WORKSPACE_DIR="${WORKSPACE:-$(pwd)}"
case "$WORKSPACE_DIR" in
  /*) ;; 
  *)  WORKSPACE_DIR="$(cd -- "$WORKSPACE_DIR" && pwd)" ;; 
esac
export WORKSPACE="$WORKSPACE_DIR"

sandbox="${PI_SANDBOX:-0}"
pi_args=()

while (($#)); do
  case "$1" in

    --sandbox)    sandbox=1 ;;
    --no-sandbox) sandbox=0 ;;
    *)            pi_args+=("$1") ;;

  esac
  shift
done

compose_files=( -f "$SCRIPT_DIR/docker-compose.yml" )
if [[ "$sandbox" == "1" ]]; then
  # an even more secure sandbox
  compose_files+=( -f "$SCRIPT_DIR/docker-compose.sandbox.yml" )
fi

# Derive a container name from the workspace directory's basename.
# Sanitize to characters Docker accepts: [a-zA-Z0-9][a-zA-Z0-9_.-]*
repo_slug="$(basename -- "$WORKSPACE_DIR" | tr -c 'a-zA-Z0-9_.-' '-' | sed 's/^-*//')"
[[ -z "$repo_slug" ]] && repo_slug="workspace"
container_name="pi-${repo_slug}-$$"

api_key_args=(
  -e OPENAI_API_KEY
  -e DEEPSEEK_API_KEY
  -e ANTHROPIC_API_KEY
  -e GEMINI_API_KEY
)

cmd=(
  docker compose
  --project-directory "$SCRIPT_DIR"
  "${compose_files[@]}"
  run --rm
  --name "$container_name"
  "${api_key_args[@]}"
  pi
)

if ((${#pi_args[@]})); then
  cmd+=("${pi_args[@]}")
fi

exec "${cmd[@]}"

I build the Docker container and make changes to the files in its own repo. Then, I run Pi in the repo I’m working in, which spins up Docker so that Pi can’t wipe files or directories by acting on my physical hard drive. This also enables Pi running in the container to see my custom model json config by shipping it into the container. All of this has been working fairly well for my experiments.

There are still issues with local models: inference can be slow, context windows are small and limited to your own hardware, and the ecosystem, although it’s made a ton easier by tooling like LM Studio and HuggingFace’s Use This Model button. Early releases suffer from prompt template mismatches. But, these are usually patched extremely quickly. Needless to say, I’m not sure this is ready for production software development quite yet.

The benefits, though, are numerous and the ecosystem critical to invest in, particularly now. One of the very cool parts of local models is you can introspect almost everything, like watching the token inference process live,

and watching tokens in/out.

You can do things like change the local context window and watch performance improve or degrade, and really dig into how your tokens are processed on the GPU. You can change the system prompt, the quantizations. You can pit models against each other. You can also change and introspect the harness side.

The possibilities are endless, and the tools only keep getting better.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Lobsters

Where are local models now?

Running agentic models locally today