惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
WordPress大学
WordPress大学
博客园 - 司徒正美
美团技术团队
酷 壳 – CoolShell
酷 壳 – CoolShell
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
小众软件
小众软件
量子位
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
有赞技术团队
有赞技术团队
博客园 - 【当耐特】
博客园 - Franky
Jina AI
Jina AI
人人都是产品经理
人人都是产品经理
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threat Research - Cisco Blogs
D
Darknet – Hacking Tools, Hacker News & Cyber Security
F
Fox-IT International blog
T
ThreatConnect
A
Arctic Wolf
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Last Week in AI
Last Week in AI
C
CERT Recently Published Vulnerability Notes
P
Palo Alto Networks Blog
李成银的技术随笔
Project Zero
Project Zero
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
F
Full Disclosure
H
Hacker News: Front Page
雷峰网
雷峰网
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
S
SegmentFault 最新的问题
S
Schneier on Security
T
Tor Project blog
博客园_首页
月光博客
月光博客
大猫的无限游戏
大猫的无限游戏
博客园 - 聂微东
S
Securelist
C
Comments on: Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Attack and Defense Labs
Attack and Defense Labs
IT之家
IT之家
博客园 - 叶小钗
J
Java Code Geeks
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events

DEV Community

Open Source Project (No.73): Sub2API - All-in-One Claude/OpenAI/Gemini Subscription-to-API Relay I Made the Wrong Bet on Event Streaming in Our Treasure Hunt Engine #ai #productivity #chatgpt #python Symbolic Constant Conundrum Building an outbound-only WebSocket bridge for local AI agents Our System's Sins in Ghana: Why We Had to Rethink Digital Product Sales Execution Governance, AI Drift, and the Security Paradox of Runtime Enforcement Differential Pair Impedance: Why USB and HDMI Routing Is a Geometry Problem Small AI database questions can become big scans Claude Code 2.1 Agent View & /goal: Autonomous Dev Guide 2026 Your AI database agent should not see every column Rust's Low-Latency Conquest: Why We Ditched C++ for a Treasure Hunt Engine Floating-point will quietly corrupt your emissions math, and 0.1 + 0.2 already warned you Autonomous Agents: what breaks first (and why that's the real product) [2026-05-23] Agent payments are the new cloud bill footgun ORA-00069 오류 원인과 해결 방법 완벽 가이드 How I Built a Local, Multimodal Gemma 4 Visual Regression & Patch Agent: Closed-Loop Validation, Canvas Pixel Diffing, and Reproducible Benchmarks Pressure-testing Ota on Supabase: from setup prose to executable repo readiness VPC CNI en EKS: cómo dejar de pagar nodos que no usás The Future of Text Analysis: Introducing TechnoHelps Semantic Engine I built a Chrome Extension that saves product images + context directly to Google Drive & Sheets 95+ browser-based dev tools that never touch a server Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama From a 10,000-line OpenSearch export script to a log analysis tool Ghost Bugs Cost $40K: A Neural Debugging Postmortem SECPAC: A Lightweight CLI Tool to Password-Protect Your Environment Variables 🚀 PasteCheck v1.7 + v1.8 — Hints that tell you what to fix, and a nudge panel that tells you where to start 8 Real Ways Developers Make Money in 2026 (Ranked by Effort) I built a free AI-powered Git CLI that writes your commit messages for you sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs OpenLiDARViewer: A Browser-Based LiDAR and Point-Cloud Viewer Local-First Browser Tools: What You Should Not Upload Online Why most freelancers undercharge (and the maths behind fixing it) We built a mahjong dangerous-tile predictor calibrated on 4.97M real hands Building a Chord Progression Generator in the Browser — Music Theory in JS, Sound via Web Audio API tutorial #10: 148 Opens, 0 Replies — How My Forge Cold Email v1 Completely Failed 9 in 10 Docker Compose files skip the basic security flags How to Forward Android SMS to Telegram Automatically I built the first security scanner for MCP servers — here's what I found Building an Interplanetary Quantum Logic Engine in Rust/Ovie From AI Code Generation to AI System Investigation I gave Gemini 3.5 Flash a CVE-fix PR to review. It found another bug in the same file. When I Realized We Were Throwing Away Half Our Engine's Potential TokenJuice and the 20-Minute Cron: Inside OpenHuman’s Aggressive Context-Harvesting Engine CodeDNA: AI Codebase Archaeologist Built with Gemma 4 Thinking Mode Building a semantic search API in Go with Meilisearch April 2026 DigitalOcean Tutorials: Inference Optimization and AI Infrastructure Looking for DTMF transceiver module Moving Beyond "Tribal Software": Why the Singularity Demands the Interplanetary Hybrid Human Use SVGIcons as a Claude Custom Connector to Find Icons Faster DMARC Is Now a Proper Internet Standard: What Changed in RFC 9989/9990/9991 OpenTelemetry Is Now a CNCF Graduate — and It's Coming for Your AI Stack OpenHuman Follows OpenClaw’s Rise, But With an Obsidian Brain O erro mais caro em programas Solana: PDA sem bump check Build a Live Flight Radar in a Single HTML File DuckDB 1.5.3 Adds Quack Client-Server, SQLite Gets Cypher Graph Extension Custom Copilot Agents: Building Domain-Expert AI Teammates with Skills, MCP Tools, and Custom Knowledge RTX 5090 Cooling, BeeLlama VRAM Opts, Resizable BAR Performance Gains This week in Cursor + .NET — 3 rules + 4 essays (week ending May 22, 2026) RAG Architecture with n8n + PostgreSQL (pgvector) + Ollama Gemma4 on AWS EC2 Keep Your Taste I Built chanprobe Because My Go Queues Were Invisible Building a Live Solana TPS Meter with OrbitFlare's TypeScript SDK Using Gemma 4 to Analyze Bitcoin’s Next 5, 15, and 60 Minutes Security news weekly round-up - 22nd May 2026 When Stress Disguises Itself as Rational Planning (Bite-size Article) A Domain-Driven Notification Microservice — Patterns From Production I Built KubeCrash: Learn Kubernetes by Diagnosing Real Incidents The Real-World Test: How Gemini’s New Interface Won Over My Wife and Mother-in-Law (Who Are Totally Non-Tech) Running a Full Multi-Stage Intrusion Simulation. Every Detection Fired. Spec sheets aren't capabilities: a Day-1 Gemma 4 eval on Telugu vision Design a Clean Form with Floating Labels in Bootstrap 5 Your MCP Server Is Probably Overprivileged - Here's a Scanner For It I built a free developer tools site that works entirely in your browser Maatru: An agentic Telugu literacy app for kids, built with Gemma 4 GitHub confirms internal repository breach via poisoned VS Code extension Gemma 4 Is Not Just Another Open Model — It Changes What Developers Can Build Locally OpenVibe: An Open-Source AI Coding IDE That Works With Any Model I Inspected the System Program and It Looked Just Like My Wallet Hermes vs OpenClaw: The Two Most-Starred AI Agent Frameworks of 2026 Stop retraining YOLO: a developer’s guide to zero-shot object detection with generative VLMs AI, the New UI, Not the New API Sensors and Guides: Two Ways Your Harness Talks to Your Agent Fixing Google BigQuery Auth Proxying We didn't ship a feature, we shipped an agentic opt-in beta Wake-Up Call: Why AI Safety Guardrails Break Under Pressure 🧩 Handling 1,000+ Inputs with Angular Reactive Forms: An Enterprise Architecture Breakdown How to Collect Telegram Media Groups in Node.js I Ran Gemma 4 on an 8GB Laptop — Here’s What the Experience Was Actually Like Lean 4 101 for Python Programmers: A Gentle Introduction to Theorem Proving From Assistants to Agents: My Take on Google I/O 2026 Learning Progress Pt.16 From Unfinished Idea to Real Product: My BuildGenAI Comeback The Quiet Strategy I Revived a 9-Year-Old App with OpenAI Codex with a Product Engineer Mindset What Enterprise RAG Is Ready For Today and What Production Deployment Actually Requires Cursor AI Pricing 2026: Is It Worth $20/Month? The Brilliant Person in Your Pocket Why your Claude API bill is 3x what it should be (and how to fix it) Sloppification Is The New Obfuscation
From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Torkian · 2026-05-23 · via DEV Community

In Part 1, we built a USC campus assistant by pasting a five-line knowledge base directly into the prompt. That works when "the data" fits in your head. It stops being cute the moment the campus handbook, club docs, and workshop notes all want a seat at the same prompt window.

The fix is retrieval — store the chunks once, and at query time pull only the few that look relevant. That's what RAG (Retrieval-Augmented Generation) actually means once you strip away the marketing.

This post takes the assistant from Part 1 and bolts on a real retriever, using NVIDIA's hosted embedding model. No vector database, no LangChain, no abstraction layer. A Python list and NumPy are enough to understand what's actually happening. Once you've seen the moving parts, swapping in pgvector or Pinecone later is a fifteen-minute job.

I'm B Torkian, NVIDIA Developer Champion at USC. Same workshop series, same campus, one more capability added.


What you're adding

User question → embed query → compare to stored chunks → pick top-k → send only those to the LLM → answer

Enter fullscreen mode Exit fullscreen mode

The model call itself barely changes. The work is in steps 2–4: turn text into vectors, compare vectors, return the closest chunks.


Why the manual approach from Part 1 breaks

In Part 1, the entire knowledge base sat inside the prompt:

campus_info = """
The USC AI Club meets every Thursday at 5 PM...
The USC GPU computing lab is open Monday to Friday...
...
"""

Enter fullscreen mode Exit fullscreen mode

Five lines is fine. But every model has a context window, and every token costs money and latency. You don't want to paste the entire USC student handbook into every question — most of it is irrelevant to "when does the AI Club meet?"

Retrieval is the answer to "which 3 paragraphs out of 3000 are actually about this question?" You compute that before calling the LLM, then send only the winners.


What an embedding actually is

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts that mean similar things land near each other in vector space. Two texts that mean different things land far apart.

NVIDIA's nv-embedqa-e5-v5 is an embedding model tuned specifically for question-answer retrieval. It has a quirk worth knowing about up front — it treats queries and passages differently. You tell it which one you're embedding via an input_type parameter. Getting this wrong is the most common beginner mistake — it still runs, but retrieval quality drops noticeably.

  • input_type='passage' → use for the documents you store
  • input_type='query' → use for the user's question at search time

That's it. Same model, two modes.


Step 1: Set up the client and ask() from Part 1

If you're continuing from Part 1, you already have these defined and can skip this cell. If you're starting fresh, paste this in first — everything later builds on it.

%pip install -q openai numpy

import os, getpass
from openai import OpenAI

if not os.getenv('NVIDIA_API_KEY'):
    os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key (starts with nvapi-): ')

client = OpenAI(
    base_url='https://integrate.api.nvidia.com/v1',
    api_key=os.environ['NVIDIA_API_KEY'],
)

MODEL = 'meta/llama-3.1-8b-instruct'

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

Enter fullscreen mode Exit fullscreen mode

client calls NVIDIA's API Catalog. ask() is the same chat-completion shape from Part 1. The retriever we're about to build slots in next to these, not instead of them.


Step 2: Build a small knowledge base and embed it as passages

import numpy as np

EMBED_MODEL = 'nvidia/nv-embedqa-e5-v5'

knowledge_base = [
    {'title': 'USC AI Club meeting',
     'text': 'The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.'},
    {'title': 'USC GPU lab hours',
     'text': 'The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.'},
    {'title': 'NVIDIA Developer Program',
     'text': 'USC students can join the NVIDIA Developer Program for free.'},
    {'title': 'Next USC workshop',
     'text': 'The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).'},
    {'title': 'USC AI/ML office hours',
     'text': 'Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.'},
    {'title': 'USC robotics lab',
     'text': 'The USC robotics lab requires safety training before students can use the soldering station.'},
    {'title': 'USC tutoring',
     'text': 'Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM.'},
]

def embed_texts(texts, input_type='passage'):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        extra_body={'input_type': input_type},
    )
    return [np.array(item.embedding, dtype=np.float32) for item in response.data]

# Embed every chunk once, as a passage. Store the vector alongside the text.
embeddings = embed_texts([item['text'] for item in knowledge_base], input_type='passage')
for item, embedding in zip(knowledge_base, embeddings):
    item['embedding'] = embedding

print(f'Embedded {len(knowledge_base)} chunks. Vector dim:', embeddings[0].shape[0])

Enter fullscreen mode Exit fullscreen mode

Two things to notice:

  • The OpenAI Python client doesn't have a native field for NVIDIA's input_type, so we pass it through extra_body. That's the right way to send provider-specific arguments without forking the client.
  • We're storing the embeddings in plain Python dicts. For seven chunks this is fine. For seven thousand, you'd reach for a vector database (and the only thing that changes is where the vectors live; the cosine math is identical).

Step 3: Retrieve the top-k chunks for a question

def cosine_similarity(a, b):
    denominator = np.linalg.norm(a) * np.linalg.norm(b)
    if denominator == 0:
        return 0.0
    return float(np.dot(a, b) / denominator)

def retrieve_context(question, k=3):
    question_embedding = embed_texts([question], input_type='query')[0]

    scored = []
    for item in knowledge_base:
        score = cosine_similarity(question_embedding, item['embedding'])
        scored.append((score, item))

    scored.sort(key=lambda pair: pair[0], reverse=True)
    top_items = [item for score, item in scored[:k]]

    return '\n'.join(f"- {item['text']}" for item in top_items)

Enter fullscreen mode Exit fullscreen mode

Three things are happening here:

  1. The question is embedded as a query, not a passage. This is the part beginners trip over. Same model, different mode.
  2. Cosine similarity scores how close the question vector is to each stored chunk vector. Numbers near 1.0 mean very similar; numbers near 0 mean unrelated.
  3. Top-k picks the highest-scoring chunks. Three is a reasonable default for a tiny knowledge base; tune it for yours.

There is no magic in step 3. A vector database would do the same comparison but use indexing tricks to do it fast at scale.


Step 4: Plug retrieval into the same ask() from Part 1

def ask_with_retrieval(question):
    context = retrieve_context(question)

    system_prompt = f"""You are a USC campus assistant. Answer ONLY using the
context below. If the answer is not in the context, say
"I don't have that information — check with the USC AI Club."

CONTEXT:
{context}
"""

    return ask(system_prompt, question)

for question in [
    'Where does the USC AI Club meet?',
    'When can I get Python tutoring at USC?',
    'What is the wifi password?',
]:
    print(f'Q: {question}')
    print(f'Context:\n{retrieve_context(question)}')
    print(f'A: {ask_with_retrieval(question)}\n')

Enter fullscreen mode Exit fullscreen mode

Run it. Three things to read carefully:

  • The first question retrieves the AI Club chunk and answers from it. Good.
  • The second retrieves the tutoring chunk and answers from it. Notice that "Python tutoring" doesn't appear verbatim in the stored text — the chunk says "introductory Python" — but the embedding model knows those are semantically close. That's the whole point of vector search over keyword search.
  • The wifi question retrieves three chunks anyway (top-k always returns k items), but none of them contain a password. The assistant falls back to the refusal line because the ONLY using the context rule forces it to. That's the guardrail from Part 1 doing its job — and it's exactly the bridge into Part 3.

Step 5: What you actually did

You replaced the hand-picked campus_info string from Part 1 with a real retrieval step. The model call is identical, and the system prompt follows the same guardrail pattern — answer only from the provided context, otherwise fall back. The only structural change is that {context} now comes from a function instead of a hardcoded constant.

That swap is the entire mental model behind RAG. Real production systems add chunking strategies, hybrid search, re-ranking, and a vector database — but the spine stays the same: embed once, embed query, compare, pass top-k to the LLM.

In your own work, the seven-line knowledge_base becomes hundreds of paragraphs scraped from PDFs, lecture notes, club Slack archives, Notion pages, or a wiki. The retriever code doesn't change. The dict-with-vector storage gets replaced by something like pgvector, Qdrant, or Pinecone the moment you outgrow a Python list.


Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open the Part 1 notebook — paste the Part 2 cells underneath.

MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.


Previously / next in this series

  • Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
  • Part 3 (next): Add Guardrails So It Doesn't Lie — a two-layer approach using prompt scope + a tiny verifier call. The fallback line that fired on the wifi question above is the foundation we build on.