惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

Unity’s AI agent went public: the developers of a static analysis tool on what that means for code quality Anna's Archive publica un llms.txt para los LLMs que rastrean su catálogo Why I Built Mneme HQ: Preventing AI Agent Architectural Drift I Built a Pay-Per-Call Crypto Signal API with x402 — Heres the Architecture 🚀 “From Prompts to Autonomous Agents: What Google I/O 2026 Changed” The Power of Distributed Consensus in Autonomous SOCs Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern The Hardest Part of Being a Developer Isn't Coding Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it 25 React Interview Questions 2026 (With Answers) — Hooks, React 19, Concurrent Mode An open source LLM eval tool with two independent quality signals Using Dashboard Filtering to Get Customer Usage in Seconds from TBs of Data Skills, Java 17, And Theme Accents 4 Hard Lessons on Optimizing AI Coding Agents Arctype: Cross-Platform Database GUI for LLM Artifacts Your robots.txt says GPTBot is welcome. Your server says 403. Organizing How to Use AWS Glue Workflow 5 n8n Automations Every Digital Agency Should Be Running (Bill More, Work Less) Getting Started with TorchGeo — Remote Sensing with PyTorch Designing a Scalable Cross-Platform Appium Framework Google Antigravity 2.0 & Slash Commands Building a Unified Adaptive Learning Intelligence with Gemma 4, Flutter, and Multi-Model Orchestration Looking for beta testers for a £60 server management application The Disk-Pressure Incident That Taught Me to Always Set LimitRanges and Other Lessons from Mirroring EKS Locally. Why AI Should Not Write SQL Against ERP Databases Vibe coding works until it doesn't. The debt is real. Shipping at the Edge: Migrating a Coffee Subscription Platform to Cloudflare Workers Stop Tab-Switching: A Developer's Guide to Color Tools That Actually Fit the Workflow DevOps vs MLOps vs AIOps: What Changes, What Stays, and a Simple Roadmap to Get Started Run Powerful AI Coding Locally on a Normal Laptop 5 n8n Automations Every WooCommerce Store Needs (Save 10+ Hours/Week) What I Learned Building My Own AI Harness Hytale Servers Will Fail Treasure Hunts Until We Fix Our Event Handling Redux in React: Managing Global State Like a Pro Unfreezing Your GitHub Actions: Troubleshooting Stuck Deployments and Protecting Your Git Repo Statistics Unlocking Project Discoverability on GHES: A Key to Software Engineering Productivity When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Why Hardcoded Automations Fail AI Agents Stop Calling It an AI Assistant. It’s Already Managing Your Company Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run AI Is Changing Engineering Culture More Than We Realize Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine AI Powered Student Learning Assistant Using Gemma 4 How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked
Show HN: Needle distilled Gemini tool calling into 26M parameters — technical read, zero hype
Juan Torchia · 2026-05-17 · via DEV Community

Show HN: Needle distilled Gemini tool calling into 26M parameters — technical read, zero hype

I was in the middle of reviewing my Ollama pipeline when the HN post appeared: Needle, a 26M parameter model distilled from Gemini specifically for tool calling. My first reaction was skeptical. 26M sounds like a toy. Then I read more carefully and understood that the interesting point isn't the size — it's the problem they're actually attacking.

Here's my technical read. No euphoria, no easy dismissal.


The real problem behind Needle and Gemini tool calling distillation

My thesis is this: the bottleneck in systems with external tools isn't the LLM's general reasoning — it's the parsability of the output. If the model produces malformed JSON, calls functions with wrong arguments, or hallucinates tool names that don't exist, the whole system breaks — doesn't matter how "intelligent" the model is at other tasks.

I ran into this directly while building agent loops with Claude Code. The most fragile part was never the reasoning; it was the reliability of the data contract. It reminded me of when I resisted TypeScript for years thinking types were bureaucracy. Then I understood that most avoidable failures start as poorly expressed data contracts. Tool calling is exactly the same: a model can be brilliant in prose and terrible at respecting a strict JSON schema under latency pressure.

Needle attacks that specific point: it takes Gemini's tool calling behavior — which is consistent and well-structured — and distills it into a small, specialized model. The hypothesis is that for this specific task, 26M parameters trained on the right behavior can outperform giant generalist models that were never fine-tuned to respect function schemas with precision.

Is it true? In their own benchmarks, according to the project repo, yes. In my own real production, I don't know yet — and that difference matters.


What knowledge distillation is and why it matters here

Knowledge distillation is a technique where a large model — the teacher — generates outputs that are then used to train a smaller model — the student. The student doesn't learn from raw data: it learns to imitate the teacher's behavior on the distributions that matter most.

# Simplified concept of the distillation pipeline for tool calling:
# 1. Teacher (Gemini) generates thousands of correct tool calling examples
# 2. Student (Needle, 26M) trains on those examples
# 3. The student learns the teacher's output distribution, not hand-written rules

Enter fullscreen mode Exit fullscreen mode

For tool calling, this makes particular sense. You don't need the model to know universal history. You need it to, when you hand it this schema:

// Tool definition — the model has to respect this 100%
const tools = [
  {
    name: "search_product",
    description: "Searches for a product by ID in the catalog",
    parameters: {
      type: "object",
      properties: {
        product_id: { type: "string" },
        include_stock: { type: "boolean" }
      },
      required: ["product_id"]
    }
  }
]

Enter fullscreen mode Exit fullscreen mode

Produce exactly:

{
  "name": "search_product",
  "arguments": {
    "product_id": "SKU-4821",
    "include_stock": true
  }
}

Enter fullscreen mode Exit fullscreen mode

Not some creative variation with renamed keys, wrong types, or invented fields. Small generalist models fail at this constantly. If Needle solves it reliably, the use case exists.


How to test it in Ollama: a reproducible checklist

If you want to validate whether a model like Needle has a place in your stack, the criterion shouldn't be someone else's benchmark. It should be your own set of tools under your system's real conditions.

# Step 1: Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# Step 2: When the model is available in the Ollama registry, pull directly
# (check availability at https://ollama.com/search)
ollama pull needle  # tentative name — verify the official registry

# Step 3: Prepare your own tool calling test suite
# Don't use the model README's examples; use YOUR real tools

Enter fullscreen mode Exit fullscreen mode

// tool-calling-test.ts
// Validation criteria I'd use to evaluate any small model

interface TestResult {
  case: string;
  expected: object;
  received: string;
  validJson: boolean;
  schemaRespected: boolean;
  latencyMs: number;
}

async function evaluateToolCallingModel(
  model: string,
  cases: Array<{ prompt: string; expectedSchema: object }>
): Promise<TestResult[]> {
  const results: TestResult[] = [];

  for (const testCase of cases) {
    const start = Date.now();

    // Call the model via Ollama API
    const response = await fetch("http://localhost:11434/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        model: model,
        messages: [{ role: "user", content: testCase.prompt }],
        // Pass tools as part of the request
        tools: [testCase.expectedSchema],
        stream: false,
      }),
    });

    const data = await response.json();
    const latency = Date.now() - start;

    // Validate if the JSON is parseable and if it respects the schema
    let validJson = false;
    let schemaRespected = false;
    let received = "";

    try {
      // The tool_call should be in message.tool_calls[0]
      const toolCall = data.message?.tool_calls?.[0];
      received = JSON.stringify(toolCall ?? data.message?.content ?? "");
      validJson = !!toolCall;
      // Basic schema validation: required keys must be present
      if (toolCall?.function?.arguments) {
        const args = toolCall.function.arguments;
        const requiredKeys = Object.keys(testCase.expectedSchema);
        schemaRespected = requiredKeys.every((k) => k in args);
      }
    } catch {
      received = "parse error";
    }

    results.push({
      case: testCase.prompt.slice(0, 50),
      expected: testCase.expectedSchema,
      received,
      validJson,
      schemaRespected,
      latencyMs: latency,
    });
  }

  return results;
}

Enter fullscreen mode Exit fullscreen mode

My minimum acceptance criteria for any tool calling model in a real system:

Metric Minimum acceptable Why
Valid JSON 99%+ A parse error in production breaks the entire flow
Schema respected 95%+ Wrong arguments are silently dangerous
p95 latency < 500ms local If it's slower than an external API, you've lost the point
Tool name hallucination 0% An invented name is a non-recoverable error

The limits that the hype doesn't mention

There are three limitations that don't show up in the headlines and that I consider essential before betting on a distilled model in a real system.

First, the teacher's distribution defines the ceiling. If Gemini has biases in how it generates tool calls — certain argument patterns, certain naming conventions — the student inherits them unfiltered. This matters if your API has conventions that drift from Gemini's style.

Second, generalization to unseen schemas is an open question. A distilled model can be excellent on the patterns it learned and brittle against complex schemas with anyOf, nested $refs, or conditional validations. You have to test it explicitly against your own schemas — don't assume the general benchmark applies.

Third, 26M parameters implies limited context capacity. In systems where the prompt includes many tools simultaneously — common in backends with dozens of endpoints exposed as tools — degradation can be significant. That's a hypothesis to validate, not assume.

None of this invalidates the project. It locates it. The same discipline I applied when reviewing pnpm workspaces cache issues in CI applies here: understand the limit first, then decide if it fits.


Where Needle makes sense and where it doesn't

Scenarios where it makes sense to try Needle:

  • Local agent pipelines where network latency to external APIs is the bottleneck
  • Edge devices or resource-constrained environments where a 26M model fits in memory comfortably
  • Systems with a bounded and stable set of tools — not dozens of shifting schemas
  • As a local fallback when external APIs are unavailable

Scenarios where it probably doesn't cut it:

  • Systems where reasoning between tool calling steps is complex — deciding when to call which tool, not just how to call it
  • APIs with deeply nested or polymorphic schemas
  • Flows where long conversational context matters — the 26M context limit is going to hurt
  • Environments that need auditable safety guarantees — a privately distilled model is a considerably more opaque box

The tension that surfaced in the Spring Boot Actuator in production post applies differently here: the comfort of "it works in the demo" can hide surface risks that only show up under load or with unexpected inputs.


What this signals for the small model ecosystem

The uncomfortable thing about Needle isn't the model itself. It's what it confirms: functional specialization is going to pressure the hegemony of large general models on structured tasks.

Tool calling, intent classification, entity extraction with fixed schemas — these are tasks where a well-trained distilled model can beat GPT-4 or Claude on cost and latency without sacrificing reliability. That changes the architecture calculation.

In my current stack with Claude Code for complex reasoning and Ollama for local tasks, there's a gap exactly where Needle would aim: the tool router that decides which function to call and with what arguments, without needing the overhead of a 70B model for that. I'm not saying I'll adopt it tomorrow. I'm saying the category makes sense and the experiment deserves follow-through.

Same as when I evaluated Jakarta EE vs Spring Boot tradeoffs or compared package managers in real monorepos, the honest answer isn't "adopt it now" or "ignore it" — it's "test it against your own criteria before committing."


FAQ: Needle, distillation, and tool calling in small models

What exactly is model distillation in the LLM context?
It's a process where a large model (teacher) generates a dataset of correct behavior — in this case, well-formed tool calling examples — which is used to train a small model (student). The student learns to imitate the teacher's output distribution on the specific tasks it was distilled for, without needing the teacher's full architecture.

Is 26M parameters enough for reliable tool calling?
Depends on the scope. For a bounded set of tools with simple schemas, probably yes. For systems with dozens of complex tools, long contexts, or multi-step reasoning, it's an open hypothesis. The project's own benchmark is optimistic; validation against your own schemas is mandatory before betting on it.

How do I test it locally without risking a production system?
With Ollama, if the model is available in the registry, it's as simple as ollama pull [name] and then evaluating with your own script against the schemas you already use. The validation checklist in this post is a starting point. Always against your real tools — never against the README examples.

What's the practical difference between Needle and using function calling from OpenAI or Anthropic?
Latency, cost, and privacy. A local model has no network RTT, no per-token cost, and doesn't send your tool schemas to an external API. The tradeoff is that reliability depends entirely on the local model's training quality, without the backing of a provider with an SLA.

Is it worth it for an individual stack or only for companies with infrastructure?
A 26M model runs on a MacBook with 8GB of RAM without drama. This isn't enterprise infrastructure. If you're already using Ollama for other tasks — like I am — adding a specialized model is operationally trivial. The real cost is evaluation time, not hardware.

What happens if the model hallucinates a tool name that doesn't exist in my system?
That's the worst case and you have to design for it as an expected failure. The routing layer that consumes the model's output has to validate that the tool call name corresponds to a registered tool before executing anything. If it doesn't exist, the error has to be explicit and not silent. This is basic defensive design, independent of which model you use.


Conclusion: test it with your eyes open

I'm not going to say Needle is the future or that it's noise. My position is more specific: functional distillation of large model behavior into small specialized models is a legitimate direction, and tool calling is a use case where it makes genuine technical sense.

What I don't buy is enthusiasm without friction. A 26M model has real limits around context, generalization, and reliability on unseen schemas. Those limits don't appear in the HN post and they will appear in production.

My concrete recommendation: if you have an agent pipeline with a stable set of tools and latency is a problem, build a test harness with your own schemas, run it against the acceptance criteria in this post, and measure. If it clears 99% valid JSON and 95% schema respected on your own cases, you have something useful. If not, you know exactly why.

That's more useful than any benchmark someone else wrote.

Are you using local models for tool calling? Tell me at juanchi.dev what stack you built and where you hit the limits.


This article was originally published on juanchi.dev