惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it 25 React Interview Questions 2026 (With Answers) — Hooks, React 19, Concurrent Mode An open source LLM eval tool with two independent quality signals Using Dashboard Filtering to Get Customer Usage in Seconds from TBs of Data Skills, Java 17, And Theme Accents 4 Hard Lessons on Optimizing AI Coding Agents Arctype: Cross-Platform Database GUI for LLM Artifacts Your robots.txt says GPTBot is welcome. Your server says 403. Organizing How to Use AWS Glue Workflow 5 n8n Automations Every Digital Agency Should Be Running (Bill More, Work Less) Getting Started with TorchGeo — Remote Sensing with PyTorch Designing a Scalable Cross-Platform Appium Framework Google Antigravity 2.0 & Slash Commands Building a Unified Adaptive Learning Intelligence with Gemma 4, Flutter, and Multi-Model Orchestration Looking for beta testers for a £60 server management application The Disk-Pressure Incident That Taught Me to Always Set LimitRanges and Other Lessons from Mirroring EKS Locally. Why AI Should Not Write SQL Against ERP Databases Vibe coding works until it doesn't. The debt is real. Shipping at the Edge: Migrating a Coffee Subscription Platform to Cloudflare Workers Stop Tab-Switching: A Developer's Guide to Color Tools That Actually Fit the Workflow DevOps vs MLOps vs AIOps: What Changes, What Stays, and a Simple Roadmap to Get Started Run Powerful AI Coding Locally on a Normal Laptop 5 n8n Automations Every WooCommerce Store Needs (Save 10+ Hours/Week) What I Learned Building My Own AI Harness Hytale Servers Will Fail Treasure Hunts Until We Fix Our Event Handling Redux in React: Managing Global State Like a Pro Unfreezing Your GitHub Actions: Troubleshooting Stuck Deployments and Protecting Your Git Repo Statistics Unlocking Project Discoverability on GHES: A Key to Software Engineering Productivity When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Stop Calling It an AI Assistant. It’s Already Managing Your Company Why Hardcoded Automations Fail AI Agents Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia AI Is Changing Engineering Culture More Than We Realize A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine AI Powered Student Learning Assistant Using Gemma 4 How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution AWS Savings Plan Buying Strategy: How to Layer, Size, and Time Commitments application.properties I built a macro tracker powered by AI + attitude Solace: A Global Mental Health First Responder Built with Gemma 4 Why Blocking Prompt Injection Is Wrong — and What to Do Instead
Building KernelMind Part 2: Hybrid Retrieval, Reranking, and Actually Retrieving Useful Code
Ishaan Mavin · 2026-05-18 · via DEV Community

By the end of the first phase of KernelMind, the repository had stopped behaving like disconnected text. Functions now had identity, relationships attached to them. The graph architecture was finally stable enough to represent execution flow across the repository.

The next challenge was obvious:

How do I retrieve the right parts of this graph efficiently?

That was where retrieval engineering began.

Initially, I shifted the retrieval pipeline to operate directly on chunks retrieved from FAISS instead of querying raw documents from MongoDB. The idea was fairly simple:

  • use embeddings to retrieve likely entry points
  • then use the graph to reconstruct surrounding execution context

That combination became the foundation of KernelMind’s retrieval pipeline.

The First Retrieval Pipeline

The naive version of retrieval looked roughly like this:

all-MiniLM-L6-v2 + FAISS

Enter fullscreen mode Exit fullscreen mode

I intentionally started lightweight because I wanted fast local experimentation while debugging retrieval behavior. At this stage, I was not trying to build the perfect retriever. I just wanted something fast enough to:

  • retrieve semantically relevant chunks
  • test graph expansion
  • debug execution flow reconstruction
  • and iterate quickly without destroying my laptop

And honestly, embeddings worked reasonably well at first.

Questions like:

How does authentication work?

Enter fullscreen mode Exit fullscreen mode

usually surfaced relevant code. But implementation-heavy queries struggled badly.

For example:

query: cookies

Enter fullscreen mode Exit fullscreen mode

might retrieve semantically similar request-handling logic instead of the actual cookie implementation.

That was the first moment I realized something important:

semantic similarity alone is not enough for repositories.

Because repositories rely heavily on exact operational language, like:

* imports
* function names
* config values
* error strings
* middleware identifiers

Enter fullscreen mode Exit fullscreen mode

Things embeddings sometimes blur together semantically.

BM25 vs Embeddings

This was where BM25 entered the system. After reading more about BM25, my rough mental model became:

embeddings understand meaning, BM25 understands exact language.

BM25 is a lexical retrieval algorithm that ranks documents using exact token overlap, token rarity, and frequency instead of semantic similarity.

That turned out to be extremely useful for repositories.

For example:

create_user()
update_user()
delete_user()

Enter fullscreen mode Exit fullscreen mode

all belong to the same semantic neighborhood. But operationally, they are completely different. Embeddings handled such conceptual understanding well.

BM25 handled lexical precision much better.

Neither alone was enough, so KernelMind evolved into hybrid retrieval. Instead of replacing embeddings entirely, I started combining both retrieval signals together using Reciprocal Rank Fusion (a fancy term for simply combining two results together).

Reciprocal Rank Fusion (RRF) helped combine both retrieval systems by
rewarding chunks that consistently appeared near the top across both FAISS
and BM25 results. 
That gave KernelMind a much more stable retrieval signal than relying on either retriever independently.

Enter fullscreen mode Exit fullscreen mode

The retrieval pipeline slowly evolved into:

Embedding Retrieval + BM25 Retrieval + Reciprocal Rank Fusion

Enter fullscreen mode Exit fullscreen mode

This improved retrieval quality almost immediately. The embedding retriever surfaced semantically relevant chunks. BM25 reinforced exact implementation-level details.

And the fusion layer combined both into a much stronger retrieval baseline.

Graph Expansion Over Retrieved Chunks

Once hybrid retrieval stabilized, I started layering the graph architecture over the retrieved results themselves. This was one of the biggest shifts in the system.

Initially, retrieval still operated mostly on isolated chunks returned from FAISS and BM25.

But repositories rarely store logic in one place.

Authentication systems, for example, are spread across routes, middleware, services, validators, token handlers, configuration, dependency layers

Retrieving one isolated chunk was often not enough to reconstruct execution flow.

So instead of treating retrieval results as final answers, I started treating them as entry points into the graph.

The pipeline became:

Retrieve relevant chunks
↓
Expand neighboring execution context
↓
Rank expanded graph nodes

Enter fullscreen mode Exit fullscreen mode

This improved workflow reconstruction dramatically.

Questions like:

How does login create the access token?

Enter fullscreen mode Exit fullscreen mode

no longer returned disconnected helper functions. The graph expansion layer started surfacing:

* login routes
* auth middleware
* token creation
* validation flows
* session handling

Enter fullscreen mode Exit fullscreen mode

as connected execution context. This was the first time I started seeing actual repository aware chunks being exposed in the pipeline.

Integrating the Cross Encoder

Even hybrid retrieval and my powerful graph architecture (from the first Blog) still produced noisy candidates. Sibling-operation pollution became a recurring issue:

create_user()
update_user()
delete_user()
read_user()

Enter fullscreen mode Exit fullscreen mode

would cluster together semantically even when only one of these actually answered the question. That was where cross encoder reranking entered the system. I started using:

cross-encoder/ms-marco-MiniLM-L-6-v2

Enter fullscreen mode Exit fullscreen mode

Initially, I didn't really know how a cross-encoder worked or whether it would be useful. So, I researched it, and basically, BM25 would match the content retrieved from the chunk with the query itself for literal lexical overlap (great for exact matches), whereas my cross-encoder would add both:

(query + chunk)

Enter fullscreen mode Exit fullscreen mode

together and directly predict relevance using neural relevance evaluations. That distinction mattered a lot. The reranker became really good at cleaning up semantically adjacent but incorrect retrievals, especially after graph expansion widened the context.

Questions like:

How does login create the access token?

Enter fullscreen mode Exit fullscreen mode

started consistently surfacing the right chunks instead of unrelated utility code nearby in semantic space.

The reranker essentially became a way to restore precision after graph expansion.

Choosing The Generation Model

Once retrieval quality became stable enough, I finally started experimenting more seriously with answer generation. I ahd all these chunks, and all the metadata with it, but for a human to make sense of it, it had to be in a proper readable format. This is where LLMs came in.

I tested several local and hosted models during development:

  • GPT-4o-mini
  • GPT-5-nano
  • Qwen 2.5 Code
  • and Sarvam’s absurdly generous free 105B model, which occasionally spoke enough sweet architectural encouragement into my ears for me to add another retrieval layer at 2 AM.

Eventually, Sarvam's 105b parameter model became the primary generation model because it gave me very good quality results FOR FREE and did not try to fry my GPU like the local models.

How the Architecture Changed

Originally, KernelMind looked something like this:

Embeddings → Retrieval → Answer

Enter fullscreen mode Exit fullscreen mode

Eventually, it evolved into:

Query
↓
BM25 Retrieval + Embedding Retrieval
↓
Reciprocal Rank Fusion (RRF)
↓
Query-Aware Seed Reranking
↓
Graph Expansion + Graph-Aware Ranking
↓
Cross-Encoder Reranking
↓
Context Building
↓
Answer Generation

Enter fullscreen mode Exit fullscreen mode

But - none of this architecture was pre-planned. Almost every layer was built because I observed some failures in the previous layers:

  • embeddings missed identifiers
  • retrieval lost workflow context
  • graph expansion introduced noise
  • re-ranking restored precision
  • orchestration improved grounding

After a little bit of fine-tuning and prompt engineering, my final answer started coming up looking like this:

Q. How is login handled in the fastapi library?
A. The login flow begins in `login_access_token()
` inside `backend/app/api/routes/login.py`.

When a POST request is sent to the login endpoint,
 FastAPI injects the submitted credentials through 
`OAuth2PasswordRequestForm`. The route then calls 
`crud.authenticate()` to validate the username and 
password against the database.


If authentication fails or the user is inactive, the
 API raises an HTTP 400 error. If authentication 
succeeds, the system generates a JWT access token 
using `security.create_access_token()`. The token 
includes the user ID and an expiration time 
configured through `ACCESS_TOKEN_EXPIRE_MINUTES`.


Finally, the endpoint returns a `Token` response
 containing the generated access token.

The retrieved workflow also shows that authenticated
 endpoints like `test_token()` depend on the 
validity of this token through FastAPI dependency 
injection, linking token generation directly to 
downstream protected routes.

Enter fullscreen mode Exit fullscreen mode

My project evolved incrementally through debugging and experimentation rather than some giant architectural master plan. And once answer generation stabilized, a much harder question appeared:

How do I actually KNOW whether the system is improving?

Because retrieval systems are easy to overestimate when you only test them manually. That eventually led into the next phase of the project:

  • evaluation
  • RAGAS benchmarking
  • retrieval ablations

and figuring out whether the architecture changes were genuinely improving the system or just looking impressive during demos.