oMLX — LLM inference, optimized for your Mac

Hacker News - Newest: "LLM"

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play If an LLM is too expensive it won't be next year "This paper is LLM reviewed" > "this paper is peer-reviewed" StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] GitHub - AssimilatedHuman/LLM-Inquisitor: Evaluating AI behaviour under real‑world work conditions to surface issues before they become problems. LLM INQUISITOR identifies failures (drift, instability etc) by observing AI during normal tasks — a tool the industry desperately needs to stem the 85% failure rate. Includes Quick Start, Practitioner’s Guide and Methodology. Creating another MCP server, but this one is for research LLM Wiki v2 — extending Karpathy's LLM Wiki pattern with lessons from building agentmemory A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents Sator Arepo - a Hugging Face Space by akolpakov Customizing an LLM for Enterprise Software Engineering Most AI agent papers stack one LLM with a vector store, we flipped it Evaluating job search ranking with LLM judged NDCG GitHub - quadracollision/llmisp: JSON AST > Clojure Parity Contracts for Polyglot LLM Commerce: A Case Study GitHub - ndom91/llama-dash: The operations layer for your local LLM stack Agentically optimizing LLM prompt cache TTLs for fun and profit Ask HN: What's your go-to LLM for coding? How do you reduce LLM spam in PR reviews? Ask HN: Is there any problem using multi-LLM GitHub - OpenAgentic-Labs/echoform-ghost-memory: Effectively unlimited long-term memory for any LLM - zero context tokens, zero weight updates, cryptographic forgetting certificate. PSA — Posture Sequence Analysis Why More Context Can Make an LLM Worse GitHub - robertoranon/tokoro: A toolbox for building event publish & discovery web sites, apps, feeds, and more GitHub - sermakarevich/chunker: Agentic approach to chunking a document A new EDIT tool for LLM agents LLMCap — Hard Dollar Caps on LLM API Calls MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5 Managing metadata is essential in LLM world Fixing LLM Writing with Distribution Fine Tuning twitter.com Show HN: An LLM that's better at writing The local shape of LLM stable regions GitHub - msunda17/impactarbiter-cli The Infrastructure Behind Making Local LLM Agents Useful PostgreSQL ext makes LLM available as an index for similarity searches,inference GitHub - Tetrahedroned/Agent-Braille: Deterministic 8-bit machine-to-machine protocol for AI agent state. ~92% fewer state-tracking tokens on real Claude Code sessions, a proven single-bit-error-safe command code, fully reproducible. Tell HN: Writing an LLM critique/takedown? – Do not use an LLM to write it 🌱 an LLM models our worst behavior Prompt eval cues predicted refusal shifts across 32k LLM rollouts Ask HN: Is Java the ideal language for LLM-assisted coding? AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ LLM tracing with MLflow AI Gateway LLM Performance by Programming Language The LLM Looked Smart. The Metrics Disagreed – tiago.rio.br The Four Horsemen of the LLM Apocalypse GitHub - piqoni/piqo-extension: A good interface is invisible Intro to TLA+ for the LLM Era: Prompt Your Way to Victory Give every tool LLM wiki and bypass Claude Code SSH Throttle The Ultimate LLM Fine-Tuning Guide Ask HN: What LLM models are you using and why? Five Agents, One Browser: Werewolf on Quack + DuckDB LLM models are not ready for orchestrating many agents ClickBook — Offline AI eReader - Apps on Google Play DeepSeek-V4-Flash means LLM steering is interesting again Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention We Built SynapseKit: The Truth About Production LLM Frameworks GitHub - albedan/ai-ml-gpu-bench: A suite to benchmark CPU/GPU Python performance in training ML models and running local LLMs GitHub - chopratejas/headroom: Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server. if you are redlining the LLM, you aren't headlining Most Meaningful Dates on the Web and for an LLM I tested 8 LLM models on Linux without using the GPU RelaxAI – UK sovereign LLM inference at 80% cheaper than OpenAI/Claude GitHub - Andyyyy64/whichllm: Find the local LLM that actually runs — and performs best — on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly. GitHub - krellixlabs/llm-reasoning-research: Curated, annotated research on reasoning gaps in large language models — temporal reasoning, causal reasoning, and beyond. Agentic evals or LLM as a judge? considering cost, time and quality Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces Add an LLM policy for `rust-lang/rust` by jyn514 · Pull Request #1040 · rust-lang/rust-forge GitHub - nimeshnayaju/markdown-parser: A streaming-capable markdown parser, written in TypeScript Dragos Documents First LLM-Assisted Strike on Water Infrastructure in Mexico Alchemize: PyMC's model to replace Stan/PyMC, etc. with an LLM BlitzGraph - The AI-native backend. Pokémon SVG Bench LLM Witch Hunts are getting F'in Irritating bliki: Interrogatory LLM Ctx-opt: TypeScript middleware to trim LLM chats to a token budget Show HN: Local-first Kubernetes YAML visualizer (no server, no LLM) Why Ruby Is the Better Language for LLM-Powered Development Paper page - Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training Show HN: Asciidia – LLM-Powered Game State media control shapes LLM behaviour by influencing training data Small Model Forensics How LLM Inference Works Multi-LLM AI trading agent harness GitHub - crawshaw/yeah: yeah: LLM-powered yes/no CLI tool Predicting Rare LLM Failures with 30× Fewer Rollouts — LessWrong Mechanism Design for Quality-Preserving LLM Advertising I tried to put an on-device LLM in an iOS Share Extension. It didn't fit Show HN: Gox – Strict static analyzer for Go designed for LLM-written code GitHub - torrix-ai/install Show HN: MCPSafe – Free security scanner for MCP servers using 5-LLM consensus Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference Atlas Inference Engine Hi-Vis: one-shot jailbreak disguised as LLM "software patch" reaching 100% ASR Loading/running every LLM with 4M ctx in 3 clicks Free AI Leak Checker — Is Your Prompt Leaking Data? GLiGuard: 16x Faster Safety Moderation with a Small Language Model - Pioneer AI by Fastino Labs Are LLM Useful for Solo Founders

2026-04-10 · via Hacker News - Newest: "LLM"

Local AI,
no more waiting on your Mac.

macOS-native MLX server with smart caching. Claude Code, OpenClaw, and Cursor respond in 5 seconds, not 90.

Apache 2.0 · Apple Silicon · macOS 15+ ·

oMLX Dashboard — Dark Mode

oMLX Dashboard — Light Mode

tok/s prompt processing

throughput with batching

<5s

TTFT from 2nd turn

∞

SSD KV cache (no eviction)

Qwen3.5-122B-A10B-4bit · M3 Ultra 512GB

Why oMLX

Built for the way
agents actually work.

Coding agents invalidate the KV cache dozens of times per session. oMLX persists every cache block to SSD — so when the agent circles back to a previous prefix, it's restored from disk in milliseconds, not recomputed from scratch.

01 — CORE

Paged SSD KV caching

Cache blocks are persisted to disk in safetensors format. Two-tier architecture: hot blocks stay in RAM, cold blocks go to SSD with LRU policy. Previously seen prefixes are restored across requests and server restarts — never recomputed.

02 — THROUGHPUT

Continuous batching

Handles concurrent requests through mlx-lm's BatchGenerator. Up to 4.14× generation speedup at 8× concurrency. No more queuing behind a single request.

03 — APP

Native macOS menu bar app

Start, stop, and monitor the server from your menu bar. Web dashboard for model management, chat, and real-time metrics. Signed, notarized, with in-app auto-update. Not Electron.

04 — MODELS

Multi-model serving

LLM, VLM, embedding, and reranker models loaded simultaneously. LRU eviction when memory runs low. Browse and download models directly from the admin dashboard.

05 — API

OpenAI + Anthropic drop-in

Compatible with Claude Code, OpenClaw, Cursor, and any OpenAI-compatible client. Native /v1/messages Anthropic endpoint. Web dashboard generates the exact config command for each tool.

06 — TOOLS

Tool calling + MCP

Supports all major tool calling formats: JSON, Qwen, Gemma, GLM, MiniMax. MCP tool integration and tool result trimming for oversized outputs. Configurable per model.

Performance

Real numbers,
real hardware.

All benchmarks on M3 Ultra 512GB. Single request and continuous batching across four popular models.

Context	Prompt TPS	Token TPS	Peak Mem
1k	768 tok/s	56.6 tok/s	65.5 GB
8k	941 tok/s	54.0 tok/s	69 GB
16k	886 tok/s	48.3 tok/s	71 GB
32k	765 tok/s	42.4 tok/s	73 GB

Continuous batching

pp1024 / tg128 · no cache reuse

Batch	Token TPS	Speedup
1×	56.6 tok/s	1.00×
2×	92.1 tok/s	1.63×
4×	135.1 tok/s	2.39×
8×	190.2 tok/s	3.36×

Context	Prompt TPS	Token TPS	Peak Mem
1k	1,462 tok/s	58.7 tok/s	80 GB
8k	2,009 tok/s	54.9 tok/s	83 GB
16k	1,896 tok/s	52.3 tok/s	83 GB
32k	1,624 tok/s	45.1 tok/s	85 GB

Continuous batching

pp1024 / tg128 · no cache reuse

Batch	Token TPS	Speedup
1×	58.7 tok/s	1.00×
2×	100.5 tok/s	1.71×
4×	164.0 tok/s	2.79×
8×	243.3 tok/s	4.14×

Context	Prompt TPS	Token TPS	Peak Mem
1k	588 tok/s	34.0 tok/s	227 GB
4k	704 tok/s	30.3 tok/s	228 GB
8k	663 tok/s	26.3 tok/s	229 GB
32k	426 tok/s	14.9 tok/s	235 GB

Continuous batching

pp1024 / tg128 · no cache reuse

Batch	Token TPS	Speedup
1×	34.0 tok/s	1.00×
2×	49.7 tok/s	1.46×
4×	109.8 tok/s	3.23×
8×	126.3 tok/s	3.71×

Context	Prompt TPS	Token TPS	Peak Mem
1k	187 tok/s	16.7 tok/s	392 GB
4k	180 tok/s	13.7 tok/s	394 GB
16k	117 tok/s	12.0 tok/s	403 GB
32k	78 tok/s	10.7 tok/s	415 GB

Continuous batching

pp1024 / tg128 · no cache reuse

Batch	Token TPS	Speedup
1×	16.7 tok/s	1.00×
2×	23.7 tok/s	1.42×
4×	47.0 tok/s	2.81×
8×	60.3 tok/s	3.61×

"The Qwen3.5 models running on oMLX is so fast that it makes running local AI on Mac worthwhile. It is so much faster than LMStudio and the tool calling is so much more reliable."

— GitHub comment, issue #62

FAQ

Common questions.

Ollama and LM Studio cache the KV state in memory, but when the context shifts mid-session — which happens constantly with coding agents — the entire cache gets invalidated and recomputed from scratch. oMLX persists every KV cache block to SSD, so previously cached portions are always recoverable. TTFT drops from 30–90 seconds to under 5 seconds on long contexts.

Apple Silicon (M1 or later) with macOS 15+. 16GB RAM is the minimum, but 64GB+ is recommended for comfortable use with larger models. The sweet spot for daily coding work is an M-series Pro/Max with 64GB+.

Yes. oMLX provides both OpenAI-compatible (/v1/chat/completions) and Anthropic-compatible (/v1/messages) API endpoints. It works as a drop-in backend for all three. The web dashboard has a one-click config generator — select your model, copy the command, paste into terminal.

No. oMLX reuses your existing LM Studio model directory — just point it at your models folder. You can also browse and download models directly from the built-in HuggingFace downloader in the admin dashboard.

Any MLX-format model from HuggingFace. This includes Qwen, LLaMA, Mistral, Gemma, DeepSeek, MiniMax, GLM, and more. Reasoning models (DeepSeek, MiniMax, Qwen) get automatic <think> tag handling. Vision-Language Models are supported since v0.2.0 with the same paged SSD caching.

Get started

Up and running
in two minutes.

Download the DMG or install from source. Reuses your existing LM Studio model directory — no re-download needed.

macOS App Recommended

Drag to Applications. The welcome screen walks you through model directory, server start, and first model download. Signed and notarized.

Download DMG

From source

Requires Python 3.10+ and Apple Silicon. Connects to any OpenAI-compatible client on localhost:8000.

git clone https://github.com/jundot/omlx
cd omlx && pip install -e .

omlx serve --model-dir ~/models

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hacker News - Newest: "LLM"

Local AI,no more waiting on your Mac.

Built for the wayagents actually work.

Real numbers,real hardware.

Common questions.

Up and runningin two minutes.

Local AI,
no more waiting on your Mac.

Built for the way
agents actually work.

Real numbers,
real hardware.

Up and running
in two minutes.