惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Hacker News - Newest: "LLM"

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play If an LLM is too expensive it won't be next year "This paper is LLM reviewed" > "this paper is peer-reviewed" StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] GitHub - AssimilatedHuman/LLM-Inquisitor: Evaluating AI behaviour under real‑world work conditions to surface issues before they become problems. LLM INQUISITOR identifies failures (drift, instability etc) by observing AI during normal tasks — a tool the industry desperately needs to stem the 85% failure rate. Includes Quick Start, Practitioner’s Guide and Methodology. Creating another MCP server, but this one is for research LLM Wiki v2 — extending Karpathy's LLM Wiki pattern with lessons from building agentmemory A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents Sator Arepo - a Hugging Face Space by akolpakov Customizing an LLM for Enterprise Software Engineering Most AI agent papers stack one LLM with a vector store, we flipped it Evaluating job search ranking with LLM judged NDCG GitHub - quadracollision/llmisp: JSON AST > Clojure Parity Contracts for Polyglot LLM Commerce: A Case Study GitHub - ndom91/llama-dash: The operations layer for your local LLM stack Agentically optimizing LLM prompt cache TTLs for fun and profit Ask HN: What's your go-to LLM for coding? How do you reduce LLM spam in PR reviews? Ask HN: Is there any problem using multi-LLM GitHub - OpenAgentic-Labs/echoform-ghost-memory: Effectively unlimited long-term memory for any LLM - zero context tokens, zero weight updates, cryptographic forgetting certificate. PSA — Posture Sequence Analysis Why More Context Can Make an LLM Worse GitHub - robertoranon/tokoro: A toolbox for building event publish & discovery web sites, apps, feeds, and more GitHub - sermakarevich/chunker: Agentic approach to chunking a document A new EDIT tool for LLM agents LLMCap — Hard Dollar Caps on LLM API Calls MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5 Managing metadata is essential in LLM world Fixing LLM Writing with Distribution Fine Tuning twitter.com Show HN: An LLM that's better at writing The local shape of LLM stable regions GitHub - msunda17/impactarbiter-cli The Infrastructure Behind Making Local LLM Agents Useful PostgreSQL ext makes LLM available as an index for similarity searches,inference GitHub - Tetrahedroned/Agent-Braille: Deterministic 8-bit machine-to-machine protocol for AI agent state. ~92% fewer state-tracking tokens on real Claude Code sessions, a proven single-bit-error-safe command code, fully reproducible. Tell HN: Writing an LLM critique/takedown? – Do not use an LLM to write it 🌱 an LLM models our worst behavior Prompt eval cues predicted refusal shifts across 32k LLM rollouts Ask HN: Is Java the ideal language for LLM-assisted coding? AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ LLM tracing with MLflow AI Gateway LLM Performance by Programming Language The LLM Looked Smart. The Metrics Disagreed – tiago.rio.br The Four Horsemen of the LLM Apocalypse GitHub - piqoni/piqo-extension: A good interface is invisible Intro to TLA+ for the LLM Era: Prompt Your Way to Victory Give every tool LLM wiki and bypass Claude Code SSH Throttle The Ultimate LLM Fine-Tuning Guide Ask HN: What LLM models are you using and why? Five Agents, One Browser: Werewolf on Quack + DuckDB LLM models are not ready for orchestrating many agents ClickBook — Offline AI eReader - Apps on Google Play DeepSeek-V4-Flash means LLM steering is interesting again Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention We Built SynapseKit: The Truth About Production LLM Frameworks GitHub - albedan/ai-ml-gpu-bench: A suite to benchmark CPU/GPU Python performance in training ML models and running local LLMs GitHub - chopratejas/headroom: Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server. if you are redlining the LLM, you aren't headlining Most Meaningful Dates on the Web and for an LLM I tested 8 LLM models on Linux without using the GPU RelaxAI – UK sovereign LLM inference at 80% cheaper than OpenAI/Claude GitHub - Andyyyy64/whichllm: Find the local LLM that actually runs — and performs best — on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly. GitHub - krellixlabs/llm-reasoning-research: Curated, annotated research on reasoning gaps in large language models — temporal reasoning, causal reasoning, and beyond. Agentic evals or LLM as a judge? considering cost, time and quality Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces Add an LLM policy for `rust-lang/rust` by jyn514 · Pull Request #1040 · rust-lang/rust-forge GitHub - nimeshnayaju/markdown-parser: A streaming-capable markdown parser, written in TypeScript Dragos Documents First LLM-Assisted Strike on Water Infrastructure in Mexico Alchemize: PyMC's model to replace Stan/PyMC, etc. with an LLM BlitzGraph - The AI-native backend. Pokémon SVG Bench LLM Witch Hunts are getting F'in Irritating bliki: Interrogatory LLM Ctx-opt: TypeScript middleware to trim LLM chats to a token budget Show HN: Local-first Kubernetes YAML visualizer (no server, no LLM) Why Ruby Is the Better Language for LLM-Powered Development Paper page - Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training Show HN: Asciidia – LLM-Powered Game State media control shapes LLM behaviour by influencing training data Small Model Forensics How LLM Inference Works Multi-LLM AI trading agent harness GitHub - crawshaw/yeah: yeah: LLM-powered yes/no CLI tool Predicting Rare LLM Failures with 30× Fewer Rollouts — LessWrong Mechanism Design for Quality-Preserving LLM Advertising I tried to put an on-device LLM in an iOS Share Extension. It didn't fit Show HN: Gox – Strict static analyzer for Go designed for LLM-written code GitHub - torrix-ai/install Show HN: MCPSafe – Free security scanner for MCP servers using 5-LLM consensus Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference Atlas Inference Engine Hi-Vis: one-shot jailbreak disguised as LLM "software patch" reaching 100% ASR Loading/running every LLM with 4M ctx in 3 clicks Free AI Leak Checker — Is Your Prompt Leaking Data? GLiGuard: 16x Faster Safety Moderation with a Small Language Model - Pioneer AI by Fastino Labs Are LLM Useful for Solo Founders
little_helper_tui/letter.md at main · sleepyeldrazi/little_helper_tui
2026-04-10 · via Hacker News - Newest: "LLM"

The only non-LLM-generated file in this repo

Hi. This little project is the culmination of the last month or so of gathering research, prototyping, and testing every strategy I can find or think of regarding extracting value out of LLMs and specifically small/local ones. It is a harness that shares a lot of one of my favourite harness's core ideas, pi, while trying to make it more specialized towards working well with small/local models and in a codebase tailored to being maintained by agents.

Note: It's all my opinions and experiences. I can be wrong. I often am. I might change my mind on half of the things here in a day. This is a snapshot of where I'm at based on what I've seen so far. All projects I mentioned I adore; they are all made by amazing people, and I hope they keep working on cool stuff so that I can keep gushing over them.

How this started

A little over a month ago, a friend asked me to "help him out with his vibe-coded project." As a "write all the code yourself" type at that point, I mostly dismissed it, in part due to me not yet having jumped in the agentic coding sphere and not knowing where to even begin with. After deciding (against my better judgment if sleep is considered) to try Claude Code and Codex and getting blown away by their one-shot capabilities, I became (skeptically) very excited. And burned my rate limits very quickly. This immediately pushed me toward "how can I optimize this, make it not finish in 10 seconds and ideally not having to rely on internet for it?".

My Research and take on the current landscape

This project's focus is a harness that, according to my research, prototypes and testing, should work well with smaller/local models.

The final phase of my research gathering was checking out what people think of the harnesses I've enjoyed using the most in the last couple of weeks, collecting their feedback, trying to pin down local/small model-specific feedback, etc. This was done by a locally running Qwen3.5 35B. Surprisingly capable little fella.

I kind of bin current research into two buckets: the "We need to give as many tools to the models" and "The models understand enough; give them bash and watch them do wonders." Personally, in my testing, neither is wrong per se, but it really feels like the truth is in using both. In the next section, I have examples of both, asking a "bare" chat model a highly technical question and a small model with a framework around it doubling (according to a benchmark) its coding prowess. If only both can be utilized..

The harnesses I've used the most are opencode, pi, forgecode and now hermes. Pi is very heavily in the "just give it bash" camp; the other three lean toward the "have a tool/skill for every occasion." From what I gathered, which I kind of expected, smaller models (14B-35B) tend to be more coherent in pi, especially when locally hosted, and you can't afford 12k of just sys prompt + tools/skills. By comparison, pi is around 2k. Little helper is around 1k but by the time you are reading this, it's probably bigger. What I also noticed is that all 4 have tool-calling problems according to the feedback, and not only on the smaller models.

Little Helper

The spec and all that can be read in the README.md, I want to talk about the behind the scenes here.

This project is built on top of research regarding prompting, orchestration tooling, and context management I've collected while testing different hypotheses in the last month or so. The research was in part found by LLMs and in part by me, summarized by LLMs.

It is built by (the not-so small) model GLM-5.1 by Z.AI. I have my notes, but a cool and useful model.

You might notice its written in C# and not the usual typescript/python/rust trifecta that seems to have taken hold of all those tools. There is a cool benchmark, namely AutoCodeBenchmark that pinned a bunch of models against eachoter on ~200 similar tasks in a bunch of languages, and C#, somewhat surprisingly for me, was one of the best "LLM writes this correctly" languages. I decided to use that info for a project I don't plan on touching the code for too much.

For the harness, I used hermes in this particular project. I like it so far. The main issue I have with it is that it has a lot of bloat for my usual needs.

Personal workflow with LLMs

Couple of things that I have found tend to work well:

  • Spend a bunch of time back and forth with the model just writing out docs and specs. Then make them as concise as reasonable, and read through them again. You will need to restart the session. You need a short, concise, and exhaustive way to onboard new sessions/models. Many of us have had the experience of having to onboard ourselves in documentation-less codebases. We now have access to world-class documentation writers. No more excuses.

  • Just start a new session at ~100k. It's not worth it beyond that, regarding the model. Compaction isn't perfect, and with a good enough onboarding doc, you don't need it.

  • Try to keep one session to one task (and fixing its errors). See above as to why. Essentially the implement -> verify -> repair loop on a per-task basis.

  • Always tell models, "ask me questions if something is unclear, don't assume." Closest to a silver bullet I've seen so far

  • Models usually remember this, but ask for unit tests. Also, after each phase, make them audit and fix issues. Then start a new session and audit again.

My Hot takes and ramblings

LLMs are stupid. Like, really stupid. And I'm not saying that just because I use small models. I've used Codex 5.3, GPT-5.4, Sonnet 4.6, Opus 4.6, Kimi K2.5, GLM-5, GLM-5.1, Minimax M2.7 among many small local ones. They all can't code to save their life. At the same time, I am more than convinced that manual code writing is essentially dead.

When you treat an LLM like an intern that has all the knowledge they could possibly need to write perfect code, but none of the wisdom to make good decisions about what, where, and how to actually use and implement, those things shine. But you need to remember that they are just an autocomplete on steroids, they get really heavily steered from context, sys prompts and user prompts, they should only be told specifically what, where, and how in this instance. You need to know that.

"So juniors are screwed." I personally don't quite agree, but the shift is very real in what a junior is and needs to be. Just knowing syntax and how to center a div, which I admittedly have given up on years ago, isn't enough anymore. You need to learn architecture and algorithms now, as a bare minimum. You need to understand the frontend and the backend, regardless in which part you will work. Good architecture means the LLMs also are more performant and useful in your codebase. "Spaghetti" means no one knows what's happening. We don't have an excuse for spaghetti anymore.

"If you just vibecode, you get rusty and forget things." True. Very true. I also learned an amazing deal. I hate the term, but I will use it, as it is essentially what I've been doing if you go by the "you don't touch any of the code" definition. While I didn't edit any lines of code in this or most of my projects in the last month, I kept looking at the code that is being written the whole time. Most of it I didn't understand. A good amount I did. I couldn't read rust or go to save my life 2 weeks ago. Now I am starting to identify problematic patterns. I only used tmux with a single window, a couple of tabs inside, and a couple of panes in each tab. Now I know about sockets, how good capturing tmux and sending keys is etc. You do gain a lot of "bad habbits" from vibecodding, but if you pay attention, you passively learn stuff you might not have known or forgot about. Not everyone is a terminal wizard, but now you can almost passively become one just from following a model's train of thought. That is worth something.

This might sound like I'm contradicting myself, but try to see the nuance. I think modern models are more than capable enough. Period. We really need to focus on making them efficient and locally runnable and provide them with frameworks and tools better suited for them. This is also why I am currently a fan of Asian lab models a bit more: they feel like a tool made to help you, not a machine trying to act human and replace you.

An example of why I think modern models are capable enough is something from this post about vulnerability research being cooked, specifically when they mention, "Is the Linux KVM hypervisor connected to the hrtimer subsystem, workqueue, or perf_event? The model knows. " I asked this qwen3.5 35B. It knew. I also asked Gemma 4 E2B and it didn't, so at least I am still smarter than a 2B model.

Before I continue, let me be clear: I don't think current benchmarking can be taken as a fact. I do think it is useful, however, to get a rough idea on model performance. There are many problems with current benchmarks (not explicitly stating the harness, which according to terminal-bench's result can swing scores a lot), us not having a deterministic way of evaluating models, etc.

With that in mind, a good example of why we should focus on tooling is ATLAS, a project that doubled the benchmarked score of a 14B model on LiveCodeBench, reaching frontier level for that specific benchmark. That's insane.

Next projects (probably)

Orchestration. I have been tinkering with the idea for a while now, there is good research on it. I want to make it work well with small models, combining my two previous findings, namely models being good generalists out of the box and really good specialist when they are inside the correct framework for the job. I think this is genuinely the "next frontier" that will shift everything. Many harnesses do this already, some do it well. I don't think it's "figured out" quite yet, though.

Local-first phone assistant model. I have a prototype on my github, I like it, it has major problems and needs a full rewrite. Information is moving too fast for a normal person nowadays, our phone's vram stays asleep most of the time, there might be something useful there.

Finetuning. I really want to figure out how (in terms of what data actually works for me) to finetune models to align with my tools. This (along with orchestration) is my next target for research. I know distil is the go-to and works well, but I want to see if people have tried doing tool-specific finetuning (composer 2.0?) and what their takeaways are, and if this pushes small models enough to make them actually usable, even if needing a specific finetune for your specific harness.

Get a good experience finally locally on a normal laptop. That is my main drive currently, I really want to have a useful assistant on my 24gb macbook that doesn't hallucinate code and doesn't get its context filled after reading one file. Turboquant seems to want to help me, and Gemma 4 26B as well as Qwen3.5 35B almost fit. ButIneed a harness/toolset that is optimized for those constraints (like little_helper). Its just a little too simple still, but Gemma 4 E4B is almost usable for this, it's small, fast, and good at tools. It does need a little more time in the oven, though.