惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Stack Overflow Blog
Stack Overflow Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
P
Proofpoint News Feed
Apple Machine Learning Research
Apple Machine Learning Research
T
Tailwind CSS Blog
罗磊的独立博客
F
Future of Privacy Forum
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
P
Privacy & Cybersecurity Law Blog
V
Visual Studio Blog
T
Tenable Blog
F
Fortinet All Blogs
D
Docker
V
Vulnerabilities – Threatpost
Cyberwarzone
Cyberwarzone
A
Arctic Wolf
T
Threat Research - Cisco Blogs
I
Intezer
T
Tor Project blog
大猫的无限游戏
大猫的无限游戏
MongoDB | Blog
MongoDB | Blog
博客园 - 司徒正美
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
G
GRAHAM CLULEY
T
Threatpost
美团技术团队
K
Kaspersky official blog
F
Fox-IT International blog
Hugging Face - Blog
Hugging Face - Blog
Vercel News
Vercel News
P
Palo Alto Networks Blog
Google DeepMind News
Google DeepMind News
T
The Blog of Author Tim Ferriss
S
Schneier on Security
腾讯CDC
Cisco Talos Blog
Cisco Talos Blog
C
Check Point Blog
博客园 - 叶小钗
I
InfoQ
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Blog — PlanetScale
Blog — PlanetScale
F
Full Disclosure
T
True Tiger Recordings
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
E
Exploit-DB.com RSS Feed
L
LINUX DO - 热门话题
J
Java Code Geeks
C
CERT Recently Published Vulnerability Notes

DEV Community

Looking for 4 people to build something weird with me The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses MTP Explained — And Why It Matters for Android on Mac Most Beginners Learn Full-Stack Development Backwards GitHub Glow-Up: Open Source, READMEs, Badges, Streaks, Git and gh CLI System Design Cheat Sheet: Concepts Every Developer Should Know Are Junior Developer Roles Actually Dying? A Fresher's Honest Take Using DigitalOcean Droplets as Ephemeral Sandboxes for AI Agents I built a VSCode extension that visualises your code navigation as a call tree — made for legacy codebase pain Vite predev/prebuild: chaining scripts without losing your mind A website to save you from messy browser tabs Dear Web2 Developer... Solana is here calling Postgres JSONB indexes: GIN vs BTREE on the same column The $5 AI That Remembers Everything What are your goals for the week? #180 Zettelkasten for Developers: A Practical Method That Works OpenClaw vs Hermes Agent: Stars, Downloads & Usage 2026 `act` vs. `waitFor` Global Teams Don’t Struggle With Time Zones. They Struggle With Context Python as a JavaScript Dev $5.4 Billion in Damage. 8.5 Million Machines Down. Three YAML Controls Would Have Prevented It. Here's the Structural Analysis. 🚫 Stop Using PN532 V1 for Your NFC Projects (Real Debugging Experience) Probabilistic Graph Neural Inference for smart agriculture microgrid orchestration for extreme data sparsity scenarios Inference Is Becoming the New Steady-State Cost Center Why AI-Generated Code Is Always Good Enough — And Never Great I built a dark admin dashboard template in HTML — no React, no npm, just pure HTML What is the Difference Between Lattice-Based and Hash-Based Signatures? Next.js App Router caching: revalidate, dynamic, and no-store without the folklore Next.js App Router caching: revalidate, dynamic y no-store sin folklore I built Stashly — a full-stack content manager with a rich text editor published: false tags: react, node, mongodb, typescript Why I Started Building React Projects Instead of Just Watching Tutorials ? Every Tool Eventually Becomes Tuesday Nobody Warns You That Real Software Engineering Feels Chaotic Tích hợp VNPay, Stripe trong Odoo 19 BeautifulSoup and Requests for Web Scraping With Python: When Simple Still Works I Was Stuck Debugging React — Then Developer Tools Changed It Buck Converter Ripple: Sizing the Inductor and Capacitor With Confidence AWS Just Made Its MCP Server Generally Available. Here's What It Actually Gives AI Agents. RAMPART Tests Your AI Agents in Dev. What Catches Malicious Tool Calls in Production? Vibe Team Software Engineering: What a Real AI Human Dev Team Workflow Actually Looks Like An npm Package for AI Agent Orchestration Just Shipped With Its Front Door Unlocked. Here's What the CVE Actually Reveals. Microsoft Foundry Just Added CI/CD for AI Agents. Here's What That Actually Changes. The Best Career Insurance Is a Tech Event You Don't Want to Attend Your GitHub Profile Already Tells Recruiters More Than Your Resume. Most Devs Just Don't Surface It. How to Add Execution Budgets to OpenAI Agents SDK Binary Tree Interview Problems: 6 Traversal Patterns, 15 Problems We trained a personal voice DoRA on Qwen3-8B for $1.50 — beat stock model 100% in blind A/B Stop Leaking API Keys: Why I Built a Local-First Vault for Developers 🔐 RAG Explained: How Retrieval-Augmented Generation Actually Works I Built a Fast Async JioSaavn API Wrapper in Python 🎧 chown & chgrp Deploying Your First App on Kubernetes: A Beginner's Guide (Minikube & Kind) Logs in code It's called a PR "review" for a reason DePIN GPU Market: The Failed Job Receipt Developers Should Demand Why Your AI Agent Monitoring is Wrong (And How to Fix It)
Building a Local-Only RAG System with Ollama and TypeScript
Pavel Espiti · 2026-05-25 · via DEV Community

Pavel Espitia

Building a Local-Only RAG System with Ollama and TypeScript

Most RAG tutorials send your private documents to OpenAI. Here's how to keep them on your laptop.

This post walks through a complete Retrieval-Augmented Generation pipeline that runs entirely on your machine. No API keys, no third-party calls, no monthly bill. Two hundred lines of TypeScript and a single binary.

What you'll build

A command-line tool that:

  1. Indexes a folder of .md or .txt files into a local vector store.
  2. Answers questions about those files using a local LLM.
  3. Cites which documents the answer came from.

By the end, you'll be able to point it at your engineering wiki, your personal notes, or your codebase, and ask questions in natural language without anything leaving your machine.

The stack

  • Ollama — runs the LLM and the embedding model.
  • @xenova/transformers — fallback embedding library if you don't want a second Ollama model.
  • sqlite-vec — SQLite extension that adds vector similarity search. Tiny, fast, no separate database server.
  • TypeScript + Node 22 — gluing it together.

Why SQLite over Chroma or Qdrant? For collections under a million chunks, SQLite is faster, simpler to deploy, and doesn't need a daemon. Your "vector database" is one file.

Setup

ollama pull nomic-embed-text       # the embedding model
ollama pull qwen2.5:7b             # the answer model

Enter fullscreen mode Exit fullscreen mode

pnpm add better-sqlite3 sqlite-vec

Enter fullscreen mode Exit fullscreen mode

Step 1: chunk and embed documents

import fs from "node:fs";
import path from "node:path";

function chunk(text: string, size = 800, overlap = 100): string[] {
  const sentences = text.split(/(?<=[.!?])\s+/);
  const chunks: string[] = [];
  let buffer = "";
  for (const s of sentences) {
    if ((buffer + " " + s).length > size && buffer) {
      chunks.push(buffer.trim());
      buffer = buffer.slice(-overlap) + " " + s;
    } else {
      buffer = buffer ? buffer + " " + s : s;
    }
  }
  if (buffer) chunks.push(buffer.trim());
  return chunks;
}

async function embed(text: string): Promise<number[]> {
  const r = await fetch("http://localhost:11434/api/embeddings", {
    method: "POST",
    body: JSON.stringify({ model: "nomic-embed-text", prompt: text }),
  });
  const json = await r.json();
  return json.embedding;
}

Enter fullscreen mode Exit fullscreen mode

nomic-embed-text returns 768-dimensional vectors. Fast enough that you can re-index a thousand-document corpus in a few minutes.

Step 2: store in SQLite

import Database from "better-sqlite3";
import * as sqliteVec from "sqlite-vec";

const db = new Database("rag.db");
sqliteVec.load(db);

db.exec(`
  CREATE TABLE IF NOT EXISTS chunks (
    id INTEGER PRIMARY KEY,
    source TEXT NOT NULL,
    content TEXT NOT NULL
  );
  CREATE VIRTUAL TABLE IF NOT EXISTS vec_chunks USING vec0(
    id INTEGER PRIMARY KEY,
    embedding FLOAT[768]
  );
`);

async function indexFile(filePath: string) {
  const text = fs.readFileSync(filePath, "utf8");
  const pieces = chunk(text);
  for (const piece of pieces) {
    const insertChunk = db.prepare(
      "INSERT INTO chunks (source, content) VALUES (?, ?)"
    );
    const result = insertChunk.run(filePath, piece);
    const vec = await embed(piece);
    db.prepare(
      "INSERT INTO vec_chunks (id, embedding) VALUES (?, ?)"
    ).run(result.lastInsertRowid, JSON.stringify(vec));
  }
}

Enter fullscreen mode Exit fullscreen mode

Step 3: search

async function search(query: string, k = 4) {
  const queryVec = await embed(query);
  const rows = db.prepare(`
    SELECT chunks.source, chunks.content, vec_chunks.distance
    FROM vec_chunks
    JOIN chunks ON chunks.id = vec_chunks.id
    WHERE vec_chunks.embedding MATCH ?
    ORDER BY distance
    LIMIT ?
  `).all(JSON.stringify(queryVec), k) as Array<{
    source: string;
    content: string;
    distance: number;
  }>;
  return rows;
}

Enter fullscreen mode Exit fullscreen mode

MATCH triggers sqlite-vec's cosine similarity. Sub-millisecond on small corpora.

Step 4: ask the LLM

async function ask(question: string) {
  const matches = await search(question, 4);

  const context = matches
    .map((m, i) => `[${i + 1}] ${m.source}\n${m.content}`)
    .join("\n\n---\n\n");

  const prompt = `Answer the question using only the context provided.
If the answer is not in the context, say so.
Cite sources by their number in square brackets.

CONTEXT:
${context}

QUESTION: ${question}

ANSWER:`;

  const r = await fetch("http://localhost:11434/v1/chat/completions", {
    method: "POST",
    body: JSON.stringify({
      model: "qwen2.5:7b",
      messages: [{ role: "user", content: prompt }],
      stream: false,
    }),
  });
  const json = await r.json();
  return {
    answer: json.choices[0].message.content,
    sources: matches.map((m) => m.source),
  };
}

Enter fullscreen mode Exit fullscreen mode

Putting it together

// Index a folder
const files = fs.readdirSync("./notes").map((f) => path.join("./notes", f));
for (const f of files) await indexFile(f);

// Ask
const result = await ask("What did we decide about the auth refactor?");
console.log(result.answer);
console.log("Sources:", result.sources);

Enter fullscreen mode Exit fullscreen mode

Total runtime, indexing 500 markdown files: about three minutes on an M2 MacBook. Per-question latency: under two seconds.

Where this matters

If your team's documentation has grown past the point where anyone reads it cover to cover (about a hundred pages), local RAG turns that wiki back into something useful. Same applies to:

  • Codebases — answer "where is the rate limiter implemented?"
  • Customer support archives — answer "what's our refund policy?"
  • Research notes — answer "what did I write about X six months ago?"
  • Legal documents — answer "what does our MSA say about indemnification?"

Last bullet matters: every legal-tech startup right now is building a cloud version of this. Yours runs on your laptop.

Tuning that actually pays off

  • Chunk size 800-1200 chars is the sweet spot. Smaller chunks lose context. Larger ones dilute relevance.
  • Overlap 10-15 percent of chunk size catches sentences split mid-thought.
  • Re-rank top-k with a cross-encoder if precision matters more than speed. Adds 100ms but often jumps relevance from 70 to 90 percent.
  • Cache embeddings keyed by content hash so re-indexing is incremental.

What's next

The previous post in this series covered function calling. Combining function calling with RAG gives you a local agent that can read your documents and take actions: "draft an email to legal summarising what our MSA says about data residency" — read MSA chunks, compose draft, call the email tool.

That's a real assistant. And nothing leaves your machine.

Next post: streaming Ollama responses through Server-Sent Events in Next.js, the production pattern for live UIs.