惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

DEV Community

HTB — MonitorsFour | Writeup Fr Deep Dive: Building "Gravity Paint" - A Tactile Physics Instrument with React, Matter.js, and p5.js ABAP Unit Testing with Test Doubles and Mocking Frameworks: A Senior Architects Guide to Isolating Dependencies in SAP S/4HANA LeetCode Solution: 5. Longest Palindromic Substring kovax-react 0.8: Tailwind v4 preset, FormField adapters, ColorModeScript, and Storybook I built an AI résumé tool that refuses to lie about your experience The hat Azure Entra ID User & Role Management — Step-by-Step Practical Guide With A Simple Excercise The AI-Native Company: How a Single Founder Can Build Global Organizations Powered by AWS and an Ecosystem of Artificial Intelligences Building a Lightweight Remote MCP Knowledge Base on Cloudflare Workers Why I built Trinavo for the MENA merchants Western platforms ignore The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Training Data Provenance: The Manifest Diff That Explains the Hash Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler Tesla P40 in a Homelab: 24GB of Inference on a Budget Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed I asked three AI models the same API question. Only one had it right. Implementing Saga Pattern With Lambda Durable Function Why does AI forget what you said (and how to fix it) I built a daily Wordle-style game for AI tools - Here's how Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router
97. Embeddings and Vector Search: Semantic Search That Works
Akhilesh · 2026-05-26 · via DEV Community

Traditional search works on keywords. You type "cheap hotel", it looks for documents containing those exact words.

Someone asks "affordable accommodation near the beach". Your documents say "budget-friendly lodging by the coast". Zero keyword overlap. Zero results. Search fails.

Embeddings fix this. They convert text into vectors of numbers where similar meanings end up geometrically close. "Cheap" and "affordable" land near each other in vector space. "Hotel" and "accommodation" land near each other. Semantic similarity becomes distance.

This powers every modern search system. ChatGPT's memory. Notion AI. GitHub Copilot context. All of them.


What You'll Learn Here

  • What embeddings are and how they encode meaning
  • Cosine similarity: measuring how close two vectors are
  • Sentence transformers: the right models for semantic search
  • Building a semantic search engine from scratch
  • FAISS: fast approximate nearest neighbor search at scale
  • ChromaDB: a vector database for production use
  • Practical patterns for document retrieval

What Embeddings Actually Are

An embedding is a dense vector of floating point numbers. Every piece of text maps to one vector.

The key property: semantically similar texts have vectors that are close together in the embedding space.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a sentence embedding model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Embed some sentences
sentences = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",
    "Dogs love to play fetch.",
    "Machine learning is a subset of AI.",
    "Artificial intelligence includes ML.",
]

embeddings = model.encode(sentences)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each sentence → {embeddings.shape[1]}-dimensional vector")
print(f"\nFirst embedding (first 8 dims): {embeddings[0][:8].round(4)}")

Enter fullscreen mode Exit fullscreen mode

Output:

Embedding shape: (5, 384)
Each sentence → 384-dimensional vector

First embedding (first 8 dims): [ 0.0234 -0.1823  0.0912  0.3421 -0.0541  0.2134 -0.0823  0.1234]

Enter fullscreen mode Exit fullscreen mode

384 numbers represent the meaning of an entire sentence. These numbers were learned during pretraining so that similar sentences produce similar vectors.


Cosine Similarity: Measuring Semantic Distance

Raw Euclidean distance doesn't work well for text embeddings. Two long documents might have large vectors that are far apart even if they discuss the same topic.

Cosine similarity measures the angle between vectors, not their magnitude. It ranges from -1 to 1. Same direction = 1. Perpendicular = 0. Opposite = -1.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare all pairs
print("Cosine similarity between sentences:")
print(f"{'Pair':<55} {'Similarity'}")
print("-" * 70)

pairs = [
    (0, 1, "cat on mat vs feline on rug"),
    (0, 2, "cat on mat vs dogs play fetch"),
    (3, 4, "ML subset AI vs AI includes ML"),
    (0, 3, "cat on mat vs ML is AI"),
]

for i, j, desc in pairs:
    sim = cosine_sim(embeddings[i], embeddings[j])
    print(f"{desc:<55} {sim:.4f}")

Enter fullscreen mode Exit fullscreen mode

Output:

Cosine similarity between sentences:
Pair                                                    Similarity
----------------------------------------------------------------------
cat on mat vs feline on rug                             0.8341
cat on mat vs dogs play fetch                           0.4123
ML subset AI vs AI includes ML                          0.8912
cat on mat vs ML is AI                                  0.1234

Enter fullscreen mode Exit fullscreen mode

"Cat on mat" and "feline on rug" score 0.83. Same concept, different words. "ML subset AI" and "AI includes ML" score 0.89. Semantically equivalent.

"Cat on mat" and "ML is AI" score 0.12. Completely different topics.


Sentence Transformers: The Right Models

Word-level models like Word2Vec average word embeddings. That loses sentence structure. Sentence transformers produce one embedding for the entire sentence, trained on sentence-level tasks.

from sentence_transformers import SentenceTransformer

# Popular embedding models

models_info = {
    'all-MiniLM-L6-v2': {
        'dim': 384,
        'size': '80MB',
        'speed': 'very fast',
        'quality': 'good',
        'note': 'Best starting point. Fast and accurate.'
    },
    'all-mpnet-base-v2': {
        'dim': 768,
        'size': '420MB',
        'speed': 'medium',
        'quality': 'excellent',
        'note': 'Best quality for semantic search.'
    },
    'paraphrase-multilingual-MiniLM-L12-v2': {
        'dim': 384,
        'size': '470MB',
        'speed': 'fast',
        'quality': 'good',
        'note': 'Supports 50+ languages.'
    },
    'text-embedding-3-small (OpenAI API)': {
        'dim': 1536,
        'size': 'API',
        'speed': 'API latency',
        'quality': 'very high',
        'note': 'Best quality. Costs per token.'
    }
}

print(f"{'Model':<45} {'Dim':<6} {'Size':<10} {'Quality'}")
print("-" * 70)
for name, info in models_info.items():
    print(f"{name:<45} {info['dim']:<6} {info['size']:<10} {info['quality']}")

# Load the recommended default
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Enter fullscreen mode Exit fullscreen mode


Building a Semantic Search Engine From Scratch

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# A knowledge base of documents
documents = [
    "Python is a high-level programming language known for its simplicity and readability.",
    "Machine learning algorithms learn patterns from data without being explicitly programmed.",
    "Neural networks are computing systems inspired by biological neural networks.",
    "The transformer architecture uses self-attention mechanisms to process sequential data.",
    "BERT is a bidirectional transformer pretrained on masked language modeling.",
    "GPT uses a decoder-only transformer trained on next-token prediction.",
    "Fine-tuning adapts a pretrained model to a specific task using domain data.",
    "LoRA reduces the number of trainable parameters by using low-rank decomposition.",
    "Vector databases store embeddings and support fast nearest-neighbor search.",
    "RAG combines retrieval with generation to give LLMs access to external knowledge.",
    "Cosine similarity measures the angle between two vectors in embedding space.",
    "Tokenization breaks text into smaller units called tokens before feeding to a model.",
    "Backpropagation computes gradients by applying the chain rule backward through a network.",
    "Overfitting occurs when a model learns the training data too well and fails on new data.",
    "Cross-validation gives a more reliable estimate of model performance than a single split.",
]

class SemanticSearch:
    def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2'):
        self.model     = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index(self, documents):
        self.documents  = documents
        print(f"Encoding {len(documents)} documents...")
        self.embeddings = self.model.encode(documents, show_progress_bar=True)
        print(f"Indexed {len(documents)} documents. Embedding shape: {self.embeddings.shape}")

    def search(self, query, top_k=3):
        # Encode the query
        query_embedding = self.model.encode([query])

        # Compute cosine similarity with all documents
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]

        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'score':    similarities[idx],
                'index':    idx
            })
        return results

# Build the search engine
search_engine = SemanticSearch()
search_engine.index(documents)

# Test queries
queries = [
    "How do transformers work?",
    "What is the difference between BERT and GPT?",
    "How can I make training more efficient?",
    "What happens when a model memorizes training data?",
]

for query in queries:
    print(f"\nQuery: '{query}'")
    print("-" * 60)
    results = search_engine.search(query, top_k=3)
    for i, r in enumerate(results):
        print(f"  {i+1}. [{r['score']:.3f}] {r['document'][:80]}...")

Enter fullscreen mode Exit fullscreen mode

Output:

Query: 'How do transformers work?'
------------------------------------------------------------
  1. [0.712] The transformer architecture uses self-attention mechanisms...
  2. [0.634] BERT is a bidirectional transformer pretrained on masked...
  3. [0.601] GPT uses a decoder-only transformer trained on next-token...

Query: 'What is the difference between BERT and GPT?'
------------------------------------------------------------
  1. [0.823] BERT is a bidirectional transformer pretrained on masked...
  2. [0.798] GPT uses a decoder-only transformer trained on next-token...
  3. [0.612] The transformer architecture uses self-attention mechanisms...

Query: 'How can I make training more efficient?'
------------------------------------------------------------
  1. [0.651] LoRA reduces the number of trainable parameters by using...
  2. [0.589] Fine-tuning adapts a pretrained model to a specific task...
  3. [0.534] Machine learning algorithms learn patterns from data...

Query: 'What happens when a model memorizes training data?'
------------------------------------------------------------
  1. [0.714] Overfitting occurs when a model learns the training data...
  2. [0.543] Cross-validation gives a more reliable estimate of model...
  3. [0.498] Fine-tuning adapts a pretrained model to a specific task...

Enter fullscreen mode Exit fullscreen mode

The search finds semantically relevant documents even when the exact words don't match. "Make training more efficient" correctly retrieves LoRA without containing the word "efficient".


FAISS: Fast Search at Scale

The brute-force approach (compare query to every document) works for thousands of documents. For millions, you need approximate nearest neighbor (ANN) search. FAISS (Facebook AI Similarity Search) is the standard tool.

pip install faiss-cpu   # or faiss-gpu for GPU support

Enter fullscreen mode Exit fullscreen mode

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Generate sample embeddings (simulating a large corpus)
model       = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
dimension   = 384   # all-MiniLM-L6-v2 embedding size

# Simulate 10,000 documents
np.random.seed(42)
fake_embeddings = np.random.randn(10000, dimension).astype('float32')
# Normalize for cosine similarity (FAISS uses inner product)
faiss.normalize_L2(fake_embeddings)

# Build FAISS index
# IndexFlatIP: exact inner product search (cosine similarity after L2 normalization)
index = faiss.IndexFlatIP(dimension)
index.add(fake_embeddings)
print(f"FAISS index size: {index.ntotal} vectors")

# Search
query_embedding = np.random.randn(1, dimension).astype('float32')
faiss.normalize_L2(query_embedding)

k = 5
distances, indices = index.search(query_embedding, k)

print(f"\nTop {k} nearest neighbors:")
for dist, idx in zip(distances[0], indices[0]):
    print(f"  Index {idx}: similarity={dist:.4f}")

Enter fullscreen mode Exit fullscreen mode

# For very large datasets: use IVF index (approximate, faster)
# IVF = Inverted File Index, partitions space into clusters

n_clusters = 100   # number of partitions (sqrt of dataset size is a good rule)
quantizer  = faiss.IndexFlatIP(dimension)
ivf_index  = faiss.IndexIVFFlat(quantizer, dimension, n_clusters, faiss.METRIC_INNER_PRODUCT)

# Must train IVF index before adding vectors
ivf_index.train(fake_embeddings)
ivf_index.add(fake_embeddings)

# Tune nprobe: how many clusters to search (higher = more accurate, slower)
ivf_index.nprobe = 10

distances_ivf, indices_ivf = ivf_index.search(query_embedding, k)
print(f"\nIVF index results (approximate but faster):")
for dist, idx in zip(distances_ivf[0], indices_ivf[0]):
    print(f"  Index {idx}: similarity={dist:.4f}")

# Benchmark: exact vs approximate
import time

# Exact search
start = time.time()
for _ in range(100):
    index.search(query_embedding, k)
exact_time = (time.time() - start) / 100

# Approximate search
start = time.time()
for _ in range(100):
    ivf_index.search(query_embedding, k)
approx_time = (time.time() - start) / 100

print(f"\nSearch time per query:")
print(f"  Exact (IndexFlatIP): {exact_time*1000:.2f}ms")
print(f"  Approximate (IVF):   {approx_time*1000:.2f}ms")
print(f"  Speedup: {exact_time/approx_time:.1f}x")

Enter fullscreen mode Exit fullscreen mode


ChromaDB: A Vector Database for Real Projects

FAISS is powerful but low-level. ChromaDB adds persistence, metadata filtering, and a clean API. Good for production use.

pip install chromadb

Enter fullscreen mode Exit fullscreen mode

import chromadb
from sentence_transformers import SentenceTransformer

# Create a ChromaDB client
client = chromadb.Client()   # in-memory; use chromadb.PersistentClient('./chroma_db') for persistence

# Create a collection
collection = client.create_collection(
    name='ml_knowledge_base',
    metadata={'hnsw:space': 'cosine'}   # use cosine similarity
)

# Your documents with metadata
docs = [
    {
        'id': 'doc1',
        'text': 'Python is a high-level programming language known for simplicity.',
        'metadata': {'topic': 'programming', 'difficulty': 'beginner'}
    },
    {
        'id': 'doc2',
        'text': 'Machine learning algorithms learn patterns from data.',
        'metadata': {'topic': 'ml', 'difficulty': 'intermediate'}
    },
    {
        'id': 'doc3',
        'text': 'Neural networks are inspired by biological neural networks.',
        'metadata': {'topic': 'deep_learning', 'difficulty': 'intermediate'}
    },
    {
        'id': 'doc4',
        'text': 'BERT is a bidirectional transformer pretrained on MLM.',
        'metadata': {'topic': 'nlp', 'difficulty': 'advanced'}
    },
    {
        'id': 'doc5',
        'text': 'LoRA reduces trainable parameters using low-rank decomposition.',
        'metadata': {'topic': 'fine_tuning', 'difficulty': 'advanced'}
    },
]

# Add documents (ChromaDB can use its own embedding model or you provide embeddings)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

collection.add(
    ids       = [d['id'] for d in docs],
    documents = [d['text'] for d in docs],
    embeddings= [model.encode(d['text']).tolist() for d in docs],
    metadatas = [d['metadata'] for d in docs]
)

print(f"Collection size: {collection.count()}")

# Basic query
results = collection.query(
    query_embeddings=[model.encode("How do transformers work?").tolist()],
    n_results=3
)

print("\nQuery: 'How do transformers work?'")
for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0])):
    print(f"  {i+1}. [{1-dist:.3f}] {doc}")   # ChromaDB returns distance, convert to similarity

Enter fullscreen mode Exit fullscreen mode

# Filter by metadata
results_filtered = collection.query(
    query_embeddings=[model.encode("machine learning concepts").tolist()],
    n_results=3,
    where={'difficulty': 'advanced'}   # only return advanced documents
)

print("\nQuery with filter (difficulty=advanced):")
for doc, meta in zip(results_filtered['documents'][0], results_filtered['metadatas'][0]):
    print(f"  [{meta['topic']}] {doc}")

Enter fullscreen mode Exit fullscreen mode

# Update and delete
collection.update(
    ids=['doc1'],
    documents=['Python is a versatile high-level programming language.'],
    embeddings=[model.encode('Python is a versatile high-level programming language.').tolist()]
)

collection.delete(ids=['doc5'])
print(f"\nAfter update and delete: {collection.count()} documents")

Enter fullscreen mode Exit fullscreen mode


Batch Encoding: Processing Large Datasets Efficiently

from sentence_transformers import SentenceTransformer
import numpy as np
import time

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Simulate a large dataset
large_corpus = [f"This is document number {i} about topic {i % 10}." for i in range(5000)]

# Efficient batch encoding
print("Encoding 5000 documents...")
start = time.time()

embeddings = model.encode(
    large_corpus,
    batch_size=64,           # process 64 at a time
    show_progress_bar=True,
    normalize_embeddings=True  # L2 normalize for cosine similarity
)

elapsed = time.time() - start
print(f"\nDone in {elapsed:.1f}s")
print(f"Speed: {len(large_corpus)/elapsed:.0f} docs/second")
print(f"Embeddings shape: {embeddings.shape}")

Enter fullscreen mode Exit fullscreen mode


Evaluating Embedding Quality

Not all embedding models perform equally on all tasks. Test before committing.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def evaluate_embeddings(model_name, test_pairs):
    """
    test_pairs: list of (sent1, sent2, label) where label=1 means similar, 0 means different
    """
    model = SentenceTransformer(model_name)

    sents1 = [p[0] for p in test_pairs]
    sents2 = [p[1] for p in test_pairs]
    labels = [p[2] for p in test_pairs]

    emb1 = model.encode(sents1)
    emb2 = model.encode(sents2)

    similarities = [cosine_similarity([e1], [e2])[0][0] for e1, e2 in zip(emb1, emb2)]

    # Threshold at 0.5 to predict similar/different
    preds = [1 if s > 0.5 else 0 for s in similarities]
    accuracy = sum(p == l for p, l in zip(preds, labels)) / len(labels)

    return accuracy, similarities

test_pairs = [
    ("cheap hotel", "affordable accommodation", 1),
    ("machine learning", "artificial intelligence", 1),
    ("cat on the mat", "deep learning model", 0),
    ("how to code in python", "python programming tutorial", 1),
    ("stock market crash", "cooking recipes", 0),
    ("neural network", "deep learning", 1),
    ("fix bug in code", "debug software", 1),
    ("the weather today", "quantum physics research", 0),
]

for model_name in ['sentence-transformers/all-MiniLM-L6-v2',
                    'sentence-transformers/all-mpnet-base-v2']:
    acc, sims = evaluate_embeddings(model_name, test_pairs)
    print(f"\n{model_name.split('/')[-1]}:")
    print(f"  Accuracy on test pairs: {acc:.1%}")
    for (s1, s2, label), sim in zip(test_pairs, sims):
        status = 'correct' if (sim > 0.5) == label else 'WRONG'
        print(f"  [{status}] sim={sim:.3f} | '{s1[:25]}' vs '{s2[:25]}'")

Enter fullscreen mode Exit fullscreen mode


Common Embedding Patterns

# Pattern 1: Asymmetric search (query and documents use different models)
# Useful when queries are short questions and documents are long passages

from sentence_transformers import SentenceTransformer

bi_encoder = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-v4')

# Documents
passages = [
    "LoRA stands for Low-Rank Adaptation and is used for efficient fine-tuning.",
    "The Eiffel Tower is a famous landmark in Paris, France.",
    "Python was created by Guido van Rossum and first released in 1991.",
]

# Short query
query = "What is LoRA?"

query_emb    = bi_encoder.encode(query)
passage_embs = bi_encoder.encode(passages)

sims = cosine_similarity([query_emb], passage_embs)[0]
top  = np.argmax(sims)
print(f"Query: '{query}'")
print(f"Best match [{sims[top]:.3f}]: '{passages[top]}'")

Enter fullscreen mode Exit fullscreen mode

# Pattern 2: Clustering embeddings to find topics
from sklearn.cluster import KMeans

sentences = [
    "Python is great for data science.",
    "R is used for statistical computing.",
    "Machine learning requires lots of data.",
    "Deep learning uses neural networks.",
    "Java is widely used in enterprise software.",
    "JavaScript powers the web frontend.",
    "Supervised learning uses labeled data.",
    "Unsupervised learning finds hidden patterns.",
]

model      = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)

print("\nClustered sentences:")
for cluster_id in range(3):
    print(f"\nCluster {cluster_id}:")
    for sent, label in zip(sentences, labels):
        if label == cluster_id:
            print(f"  - {sent}")

Enter fullscreen mode Exit fullscreen mode


Quick Cheat Sheet

Concept What it means
Embedding Dense vector representing text semantics
Cosine similarity Angle between vectors. 1=same, 0=orthogonal, -1=opposite
L2 normalization Scale vectors to unit length before cosine/dot product
FAISS IndexFlatIP Exact search with inner product (cosine after L2 norm)
FAISS IVF Approximate search, partitions space into clusters
ChromaDB Vector database with persistence and metadata filtering
nprobe FAISS IVF: number of clusters to search. Higher=more accurate
Batch encoding Encode many texts at once for efficiency
Task Code
Load model SentenceTransformer('all-MiniLM-L6-v2')
Encode text model.encode(texts, normalize_embeddings=True)
Cosine similarity cosine_similarity([query_emb], doc_embs)[0]
FAISS exact faiss.IndexFlatIP(dim)
FAISS approximate faiss.IndexIVFFlat(quantizer, dim, n_clusters)
ChromaDB add collection.add(ids, documents, embeddings, metadatas)
ChromaDB search collection.query(query_embeddings, n_results=5)
Top-k results np.argsort(similarities)[::-1][:k]

Practice Challenges

Level 1:
Build a semantic search engine on a topic you care about. Gather 30+ paragraphs of text (Wikipedia articles, blog posts, documentation). Encode them with all-MiniLM-L6-v2. Search for 5 different queries and print the top 3 results with similarity scores. Are the results actually relevant?

Level 2:
Compare two embedding models (all-MiniLM-L6-v2 vs all-mpnet-base-v2) on the same 20 query-document pairs. Which one finds more relevant results? Is the quality difference worth the size difference?

Level 3:
Build a ChromaDB-backed search engine that indexes 200+ documents with metadata (category, date, author). Implement both semantic search and filtered search (find documents from category X that are semantically similar to query Y). Add a function that returns results above a similarity threshold and rejects everything below.


References


Next up, Post 98: RAG: Give Your AI Access to Your Documents. Retrieval Augmented Generation combines semantic search with LLM generation. Ask questions about any document and get accurate, grounded answers.