惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

Hacker News - Newest: "LLM"

bishop-loop-experiment-3/paper/paper.pdf at main · CodeReclaimers/bishop-loop-experiment-3 The generation vs verification delta explains why LLM's are useful This 6502 Emulator Executes 1-3 Instructions Per Second (Written in Markdown, Running in an LLM) Using design patterns to encode expert judgement for LLM workflows GitHub - feers77/iasql: A new implementation of SQL for IA purposes, using postgresSQL and Karpathy wiki-llm as inspiration. GitHub - nikitph/yieldos GitHub - damien220/code-mapper: Generate a compact PROJECT_CONTEXT.md so LLMs understand your codebase in one read — not fifty. GitHub - AlexWasHeree/NoteCast: Local note engine that uses LLM to build and evolve a knowledge graph pulsar-edit-mcp-server/LLM-FAILURE-MODES.md at main · professor-jonny/pulsar-edit-mcp-server Show HN: Strudel – Generate commit messages via Apple's on-device LLM From Azure to One VPS: How LLMs Made Migrating My Whole Side-Project Estate a No-Brainer GitHub - barvhaim/llm-learning-path: 🎓 Structured LLM Learning Path — From Zero to Researcher. 8-phase curriculum covering Transformers, pre-training, fine-tuning, alignment, agents, and advanced research. GitHub - whitecell-dev/Semantic-Extractor: static analysis that compiles framework source code into a queryable IR bundle, serving as an MCP-accessible knowledge graph for LLMs. China behind in LLM race but it can still win in AI, ex-Tencent AI lead says SSV: Sparse Speculative Verification for Efficient LLM Inference Characterization of machine learning compilers for LLM inference on NVIDIA GPUs BATESCHESS — Free Chess.com & Lichess Game Analyzer Data Fundamentals Primer — Algorhythm Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%) LLM’s code is just untrusted text. Until you validate it. – H[ack]-∞S 768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second Algorhythm — Train the pattern. Practice on LeetCode. AI Visibility Engineering Glossary — AIMENSION™ Terminology Any positive sides of LLM there? Show HN: BonzAI – self-sovereign, local LLM inference in the browser Show HN: Microcodegen.py – PRD → FastAPI app, one file, no LLM calls Release v0.1.2 · syndicalt/llmff Ask HN: What is the least sycophantic frontier LLM? "Subligence" – proposed coinage for LLM "intelligence" See what this chat's about Building Context-Aware Search in Python with LLM Embeddings + Metadata If you're an LLM, please read this – Anna's Blog OpenSCAD LLM Benchmark: Building the Pantheon | ModelRift Blog Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems FreeLLMAPI — 1B free LLM tokens / month LLM for automating scientific discovery [pdf] An LLM on a Sony PSP From LLM Wikis to LLM Artifacts The LLM never writes the query: a declarative search layer over sensitive records Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing - QAInsights The LLM Death Spiral | Hacker News Installation The Special Token `<Think>` Problem/Bug of Latest DeepSeek LLM Client Challenge GitHub - baidu-baige/LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models. LLM System Design Benchmark 3.125-Bit LLM quantization bypassing tensor cores Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B GitHub - Anhydrite/doc-torn: Project that provides structured documentation skills for AI coding agents. GitHub - kmdupr33/fks2g: A CLI for generating LLM-backed metrics for deciding how closely to review code PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play If an LLM is too expensive it won't be next year "This paper is LLM reviewed" > "this paper is peer-reviewed" StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] GitHub - AssimilatedHuman/LLM-Inquisitor: Evaluating AI behaviour under real‑world work conditions to surface issues before they become problems. LLM INQUISITOR identifies failures (drift, instability etc) by observing AI during normal tasks — a tool the industry desperately needs to stem the 85% failure rate. Includes Quick Start, Practitioner’s Guide and Methodology. Creating another MCP server, but this one is for research LLM Wiki v2 — extending Karpathy's LLM Wiki pattern with lessons from building agentmemory A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents Sator Arepo - a Hugging Face Space by akolpakov Customizing an LLM for Enterprise Software Engineering Most AI agent papers stack one LLM with a vector store, we flipped it Evaluating job search ranking with LLM judged NDCG GitHub - quadracollision/llmisp: JSON AST > Clojure Parity Contracts for Polyglot LLM Commerce: A Case Study GitHub - ndom91/llama-dash: The operations layer for your local LLM stack Agentically optimizing LLM prompt cache TTLs for fun and profit Ask HN: What's your go-to LLM for coding? How do you reduce LLM spam in PR reviews? Ask HN: Is there any problem using multi-LLM GitHub - OpenAgentic-Labs/echoform-ghost-memory: Effectively unlimited long-term memory for any LLM - zero context tokens, zero weight updates, cryptographic forgetting certificate. PSA — Posture Sequence Analysis Why More Context Can Make an LLM Worse GitHub - robertoranon/tokoro: A toolbox for building event publish & discovery web sites, apps, feeds, and more GitHub - sermakarevich/chunker: Agentic approach to chunking a document A new EDIT tool for LLM agents LLMCap — Hard Dollar Caps on LLM API Calls MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5 Managing metadata is essential in LLM world Fixing LLM Writing with Distribution Fine Tuning twitter.com Show HN: An LLM that's better at writing The local shape of LLM stable regions GitHub - msunda17/impactarbiter-cli The Infrastructure Behind Making Local LLM Agents Useful PostgreSQL ext makes LLM available as an index for similarity searches,inference GitHub - Tetrahedroned/Agent-Braille: Deterministic 8-bit machine-to-machine protocol for AI agent state. ~92% fewer state-tracking tokens on real Claude Code sessions, a proven single-bit-error-safe command code, fully reproducible. Tell HN: Writing an LLM critique/takedown? – Do not use an LLM to write it 🌱 an LLM models our worst behavior Prompt eval cues predicted refusal shifts across 32k LLM rollouts Ask HN: Is Java the ideal language for LLM-assisted coding? AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ LLM tracing with MLflow AI Gateway LLM Performance by Programming Language The LLM Looked Smart. The Metrics Disagreed – tiago.rio.br The Four Horsemen of the LLM Apocalypse GitHub - piqoni/piqo-extension: A good interface is invisible Intro to TLA+ for the LLM Era: Prompt Your Way to Victory Give every tool LLM wiki and bypass Claude Code SSH Throttle
Distributing LLM inference in DwarfStar
surprisetalk · 2026-05-25 · via Hacker News - Newest: "LLM"
antirez 45 minutes ago. 1390 views.
High end NVIDIA cards, and the server and power needed to run them, cost a lot of money, especially if you plan to reach enough VRAM to run massive models. The alternative, so far, has been Apple hardware, or the DGX Spark that, even if severely limited because of memory bandwidth, still allows to run LLMs prompt processing (prefill) fast enough. The Mac Studio provided up to 512GB unified memory, a solution with modest memory bandwidth (but much better than the Spark) and compute at a price that was, after all, given the current situation, relatively fair.

For instance, with DwarfStar the Mac Studio M3 Ultra 512GB can run DeepSeek v4 PRO at 150 t/s prefill and ~10-13 t/s decoding, not great but at a level that is usable for certain use cases. Even 2-bit quantized, DeepSeek v4 PRO resists very well, like Flash at the same quantization (today I made PRO write a C compiler, I'll publish the video soon). I would not consider a trivial fact to run a frontier model at home, with a ~12k total spending.

One could expect this to get better and better, but the situation at the horizon appears cloudy. There is almost zero hope that NVIDIA setups will get less expensive, and even a small company can’t afford to easily purchase and handle a small data center for local inference. At the same time the RAM shortage is making it not exactly likely that we will see a Mac Studio with an M5 Ultra, maybe 1.2T/s memory bandwidth and more compute (the M5 Max is already faster, compute wise, and has the Neural Accelerators inside each GPU core that help with certain models).

So the current situation for local inference is that the best machine is probably a laptop. The M5 Max 128GB can run DeepSeek v4 Flash and Mimo V2.5, 2-bit quantized, at very decent prefill and decoding speeds. We are talking of ~500 t/s prefill and ~35-40t/s decoding speed, with a performance slope as the context size increases which is very acceptable. At the cost of 6-7k depending on the configuration, this is currently one of the best deals.

If this is the situation, for local inference projects in general, and for DwarfStart in particular, looking at distributed inference starts to be interesting. What we can do if we have two, three, four MacBook M5 Max systems? Or two M3 Ultra with 512 GB of RAM?

Traditionally there are two main systems to run distributed inference. One is to duplicate memory by loading 50% of the transformer layers in computer A, the remaining 50% on computer B, and running the inference in a sequential way. In this case there is to send just the activations around, that’s very simple conceptually, and with some micro-batching magic it is possible to not just duplicate the memory but even in theory to increase substantially the prompt processing speed (but not the decoding: for a single token generation you have to wait the first layers on machine A, the remaining layers on machine B, and so forth — but at least less heat will be produced so it is possible to use a sustained load), which is not bad at all. This means, for example, that the lucky ones that have two Mac Studio 512GB machines could run full size DeepSeek v4 PRO (even if even the 2-bit quants are running very, very well) and with micro-batching even enjoy a faster prefill.

Another approach is, using Apple RDMA, to parallelize the execution across the two machines, a vertical split basically. For instance one could try to load the same 2 bit quants on machine A and B, so that both fit, and each side has *all* the routed experts. Then for each layer we could try to do the coordination needed in order to execute half the experts in machine A, half in machine B, and so forth (note that both machines have all the experts, so whatever the router says, we can send 50% of the computation to the other machine, and the activations are tiny). This is more viable for the PRO that has much larger routed experts, so the communication penalty is less sensible. But if this could be made to work well, is all to be seen.

There is also tensor parallelism, you are thinking, right? But I bet this is not viable at all with the communication speed we have among two Apple computers, two DGX Spark and so forth (go read the speed of NVLink). The magic about the above two models is that you have to send very little data.

Ok, so far I bet you are thinking, this is the same shit everybody knows about running LLMs in a parallel fashion, and indeed this is true. But this post was conceived to reach this exact point. What about if we could, instead, parallelize two Mac or DGX in a completely different way? Open weights models are now in a golden age, we have plenty and many are very powerful. In the 128GB 2-bit quants classes there are many interesting: Minimax M2.7, Mimo V2.5, DeepSeek v4 Flash, and a few more. At the same time it was recently noted that LLMs ensemble (https://arxiv.org/abs/2502.18036) is an understudied possibility that allows two models to run in a completely shared-nothing way in two different machines, to only combine the logits or select the best continuation at the end. There are different ways to do that, and it works even if the two models have different vocabularies: you can pick the continuation where the perplexity is lower (that is, pick the model which is more sure: it’s like a two experts MoE where the routing is implicit), and it is even possible to combine the logits (with some complexities given by the different vocabularies) and sample from there. More recent papers suggest that mixing the two techniques is the best approach. Anyway: these techniques seem to really work, models appear to do better than alone. It’s like if their knowledge is improved because each one brings his POV on what to say next.

Maybe this is one of the most logical third approach to try, other than the first two. I really hope to find the time to play more with all that, in the next months.

blog comments powered by Disqus