惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Hacker News - Newest: "LLM"

Ask HN: Is there a need for YAML in post-LLM world? Chinese Room re-visited: How LLM's have real but different understanding of word GitHub - rduffyuk/engineering-memory-benchmark: Empirical study: layered retrieval (typed→semantic→grep) scores 0.954 for LLM-generated engineering artifacts. 5 conditions, 3 model tiers, 36 generated ADRs, 23 score files. Nano Browser LLM Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper) Welcome to Outlines! - Outlines Multi-Agent LLM Orchestration with Docker Compose and MCP You don't need all the LLM benchmarks Debugging Unfamiliar Code with a Multi-LLM Loop – Barrett Sonntag LLM proactively bypassed pnpm's anti-supply-chain-attack config Norway's 2 petabytes of Huawei flash storage and LLM training SynapCores — the AI-native database Distributing LLM inference in DwarfStar bishop-loop-experiment-3/paper/paper.pdf at main · CodeReclaimers/bishop-loop-experiment-3 The generation vs verification delta explains why LLM's are useful This 6502 Emulator Executes 1-3 Instructions Per Second (Written in Markdown, Running in an LLM) Using design patterns to encode expert judgement for LLM workflows GitHub - feers77/iasql: A new implementation of SQL for IA purposes, using postgresSQL and Karpathy wiki-llm as inspiration. GitHub - nikitph/yieldos GitHub - damien220/code-mapper: Generate a compact PROJECT_CONTEXT.md so LLMs understand your codebase in one read — not fifty. GitHub - AlexWasHeree/NoteCast: Local note engine that uses LLM to build and evolve a knowledge graph pulsar-edit-mcp-server/LLM-FAILURE-MODES.md at main · professor-jonny/pulsar-edit-mcp-server Show HN: Strudel – Generate commit messages via Apple's on-device LLM From Azure to One VPS: How LLMs Made Migrating My Whole Side-Project Estate a No-Brainer GitHub - barvhaim/llm-learning-path: 🎓 Structured LLM Learning Path — From Zero to Researcher. 8-phase curriculum covering Transformers, pre-training, fine-tuning, alignment, agents, and advanced research. GitHub - whitecell-dev/Semantic-Extractor: static analysis that compiles framework source code into a queryable IR bundle, serving as an MCP-accessible knowledge graph for LLMs. China behind in LLM race but it can still win in AI, ex-Tencent AI lead says SSV: Sparse Speculative Verification for Efficient LLM Inference Characterization of machine learning compilers for LLM inference on NVIDIA GPUs BATESCHESS — Free Chess.com & Lichess Game Analyzer Data Fundamentals Primer — Algorhythm Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%) LLM’s code is just untrusted text. Until you validate it. – H[ack]-∞S 768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second Algorhythm — Train the pattern. Practice on LeetCode. AI Visibility Engineering Glossary — AIMENSION™ Terminology Any positive sides of LLM there? Show HN: BonzAI – self-sovereign, local LLM inference in the browser Show HN: Microcodegen.py – PRD → FastAPI app, one file, no LLM calls Release v0.1.2 · syndicalt/llmff Ask HN: What is the least sycophantic frontier LLM? "Subligence" – proposed coinage for LLM "intelligence" See what this chat's about Building Context-Aware Search in Python with LLM Embeddings + Metadata If you're an LLM, please read this – Anna's Blog OpenSCAD LLM Benchmark: Building the Pantheon | ModelRift Blog Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems FreeLLMAPI — 1B free LLM tokens / month LLM for automating scientific discovery [pdf] An LLM on a Sony PSP From LLM Wikis to LLM Artifacts The LLM never writes the query: a declarative search layer over sensitive records Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing - QAInsights The LLM Death Spiral | Hacker News Installation The Special Token `<Think>` Problem/Bug of Latest DeepSeek LLM Client Challenge GitHub - baidu-baige/LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models. LLM System Design Benchmark 3.125-Bit LLM quantization bypassing tensor cores Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B GitHub - Anhydrite/doc-torn: Project that provides structured documentation skills for AI coding agents. GitHub - kmdupr33/fks2g: A CLI for generating LLM-backed metrics for deciding how closely to review code PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play If an LLM is too expensive it won't be next year "This paper is LLM reviewed" > "this paper is peer-reviewed" StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] GitHub - AssimilatedHuman/LLM-Inquisitor: Evaluating AI behaviour under real‑world work conditions to surface issues before they become problems. LLM INQUISITOR identifies failures (drift, instability etc) by observing AI during normal tasks — a tool the industry desperately needs to stem the 85% failure rate. Includes Quick Start, Practitioner’s Guide and Methodology. Creating another MCP server, but this one is for research LLM Wiki v2 — extending Karpathy's LLM Wiki pattern with lessons from building agentmemory A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents Sator Arepo - a Hugging Face Space by akolpakov Customizing an LLM for Enterprise Software Engineering Most AI agent papers stack one LLM with a vector store, we flipped it Evaluating job search ranking with LLM judged NDCG GitHub - quadracollision/llmisp: JSON AST > Clojure Parity Contracts for Polyglot LLM Commerce: A Case Study GitHub - ndom91/llama-dash: The operations layer for your local LLM stack Agentically optimizing LLM prompt cache TTLs for fun and profit Ask HN: What's your go-to LLM for coding? How do you reduce LLM spam in PR reviews? Ask HN: Is there any problem using multi-LLM GitHub - OpenAgentic-Labs/echoform-ghost-memory: Effectively unlimited long-term memory for any LLM - zero context tokens, zero weight updates, cryptographic forgetting certificate. PSA — Posture Sequence Analysis Why More Context Can Make an LLM Worse GitHub - robertoranon/tokoro: A toolbox for building event publish & discovery web sites, apps, feeds, and more GitHub - sermakarevich/chunker: Agentic approach to chunking a document A new EDIT tool for LLM agents LLMCap — Hard Dollar Caps on LLM API Calls MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5 Managing metadata is essential in LLM world Fixing LLM Writing with Distribution Fine Tuning twitter.com Show HN: An LLM that's better at writing The local shape of LLM stable regions GitHub - msunda17/impactarbiter-cli The Infrastructure Behind Making Local LLM Agents Useful PostgreSQL ext makes LLM available as an index for similarity searches,inference
GitHub - 0xchamin/mcptube: Transform YouTube videos into a compounding knowledge base with transcripts, vision analysis, and agentic search. Works as an MCP server for Claude, Copilot & more.
2026-04-14 · via Hacker News - Newest: "LLM"

YouTube video knowledge engine — transcripts, vision, and persistent wiki.

PyPI Python License: MIT

mcptube-vision transforms YouTube videos into a persistent, structured knowledge base using both transcripts and visual frame analysis. Built on the Karpathy LLM Wiki pattern: knowledge compounds with every video you add.

Evolved from mcptube v0.1 — mcptube-vision replaces semantic chunk search with a persistent wiki that gets smarter with every video ingested.


🧠 How It Works

Traditional video tools re-discover knowledge from scratch on every query. mcptube-vision is different:

           mcptube v0.1                    mcptube-vision
    ┌─────────────────────┐         ┌─────────────────────────┐
    │ Query → vector search│         │ Video ingested → LLM     │
    │ → raw chunks → LLM  │         │ extracts knowledge →     │
    │ → answer (from scratch│        │ wiki pages created →     │
    │   every time)        │         │ cross-references built   │
    └─────────────────────┘         │                         │
                                    │ Query → FTS5 + agent    │
                                    │ → reasons over compiled  │
                                    │   knowledge → answer     │
                                    └─────────────────────────┘
v0.1 (Video Search Engine) vision (Video Knowledge Engine)
On ingest Chunk transcript, embed in vector DB LLM watches + reads, writes wiki pages
On query Find similar chunks Agent reasons over compiled knowledge
Frames Timestamp or keyword extraction Scene-change detection + vision model
Cross-video Re-search all chunks each time Connections already in the wiki
Over time Library of isolated videos Compounding knowledge base

🏗️ Technical Architecture

mcptube-vision is built around a core insight: video knowledge should compound, not be re-discovered. Every architectural decision flows from this principle.

System Overview

flowchart TD
    YT[YouTube URL] --> EXT[YouTubeExtractor\ntranscript + metadata]
    EXT --> FRAMES[SceneFrameExtractor\nffmpeg scene-change detection]
    FRAMES --> VISION[VisionDescriber\nLLM vision model]
    VISION --> WIKI_EXT[WikiExtractor\nLLM knowledge extraction]
    EXT --> WIKI_EXT
    WIKI_EXT --> WIKI_ENG[WikiEngine\nmerge + update]
    WIKI_ENG --> FILE[FileWikiRepository\nJSON pages on disk]
    WIKI_ENG --> FTS[SQLite FTS5\nsearch index]
    FILE --> AGENT[Ask Agent\nFTS5 → LLM reasoning]
    FTS --> AGENT
    FILE --> CLI[CLI / MCP Server]
    FTS --> CLI

    subgraph Ingestion Pipeline
        EXT
        FRAMES
        VISION
        WIKI_EXT
    end

    subgraph Knowledge Store
        WIKI_ENG
        FILE
        FTS
    end

    subgraph Retrieval
        AGENT
    end
Loading

The system overview shows three distinct subsystems connected by a unidirectional data flow. The Ingestion Pipeline (left) transforms a raw YouTube URL into structured knowledge through four stages: transcript extraction, scene-change frame detection, vision-model description, and LLM-powered knowledge extraction. Each stage enriches the signal — raw video becomes text, text becomes typed knowledge objects.

The Knowledge Store (center) is the persistent layer. The WikiEngine applies merge semantics — deciding whether to create new pages or append to existing ones — then writes JSON files to disk and updates the FTS5 search index in parallel. These two stores serve different access patterns: files for full-page reads and exports, FTS5 for sub-millisecond keyword retrieval.

The Retrieval layer (right) combines both stores. The Ask Agent first narrows via FTS5, then loads full pages from disk, and finally reasons over candidates with structural awareness from the wiki TOC. The CLI and MCP Server sit alongside as thin presentation layers — they never contain business logic.


Ingestion Flow

sequenceDiagram
    participant User
    participant CLI
    participant YouTubeExtractor
    participant SceneFrameExtractor
    participant VisionDescriber
    participant WikiExtractor
    participant WikiEngine
    participant FileRepo
    participant FTS5

    User->>CLI: mcptube add <url>
    CLI->>YouTubeExtractor: fetch transcript + metadata
    YouTubeExtractor-->>CLI: segments, duration, channel

    CLI->>SceneFrameExtractor: extract scene frames (ffmpeg)
    SceneFrameExtractor-->>CLI: frame images (scene_000x.jpg)

    CLI->>VisionDescriber: describe frames (LLM vision)
    VisionDescriber-->>CLI: frame descriptions (prose)

    CLI->>WikiExtractor: extract knowledge\n(transcript + frame descriptions)
    WikiExtractor-->>CLI: entities, topics, concepts, video page

    CLI->>WikiEngine: merge into wiki
    WikiEngine->>FileRepo: write/update JSON pages\n(append entities, rewrite synthesis)
    WikiEngine->>FTS5: update search index
    FileRepo-->>WikiEngine: ✅
    FTS5-->>WikiEngine: ✅
    WikiEngine-->>CLI: wiki processed
    CLI-->>User: ✅ Added + Wiki: full_analysis
Loading

The ingestion flow is a write-once pipeline — LLM-heavy at ingest time, but never repeated for the same video. This is the key cost tradeoff: invest tokens upfront to build compiled knowledge, so retrieval is cheap.

The sequence shows two critical branching points. First, after transcript extraction, the pipeline forks into vision processing (scene frames → LLM vision descriptions) and feeds both streams into the WikiExtractor. This dual-signal approach means the LLM sees both what was said and what was shown — critical for content like coding tutorials or slide-based lectures where the transcript alone misses visual information.

Second, the WikiEngine merge step is where knowledge compounding happens. Rather than blindly writing new pages, it checks for existing entities, topics, and concepts — appending new video contributions to existing pages and rewriting synthesis summaries. This is why ingesting video #10 makes the wiki smarter about videos #1–9 too: shared concepts get richer synthesis with each new source.

The final FTS5 index update runs synchronously after the file write, ensuring search consistency. There is no eventual-consistency window — once add_video returns, all new knowledge is immediately searchable.


Retrieval Flow

sequenceDiagram
    participant User
    participant CLI
    participant FTS5
    participant FileRepo
    participant Agent

    User->>CLI: mcptube ask "What is RLHF?"

    CLI->>FTS5: keyword search (sanitized query)
    FTS5-->>CLI: candidate page slugs (ranked)

    CLI->>FileRepo: load candidate pages (JSON)
    FileRepo-->>CLI: wiki pages (entities, topics, concepts)

    CLI->>FileRepo: load wiki TOC
    FileRepo-->>CLI: table of contents (all page titles + types)

    CLI->>Agent: candidates + TOC + question
    Agent-->>CLI: reasoned answer with source citations

    CLI-->>User: answer + (source-slug) citations
Loading

The retrieval flow is deliberately two-stage to balance cost and intelligence. The first stage — FTS5 keyword search — runs entirely locally with zero LLM tokens, narrowing thousands of wiki pages to a ranked handful in milliseconds. Query sanitization strips special characters (e.g. ?, !) that would break FTS5 syntax, ensuring robustness for natural-language questions.

The second stage loads two types of context for the agent: the candidate pages (full detail — summaries, contributions, entity references) and the wiki TOC (a compact structural map of all knowledge). The TOC is critical — it gives the agent awareness of what it doesn't know. Without it, the agent would hallucinate answers from weak matches. With it, the agent can reason: "The wiki has pages on RLHF and scaling laws, but nothing on quantum computing — so I should say I don't have that information."

In CLI mode (BYOK), the agent is an LLM call that synthesizes the final answer with source citations. In MCP server mode (passthrough), this stage returns the raw candidates and TOC to the client — letting the client's own model (Copilot, Claude, Gemini) do the reasoning. This dual-mode design means the server never requires an API key when used via MCP.


Subsystem Breakdown

1. Ingestion Pipeline

YouTubeExtractor pulls transcript segments via youtube-transcript-api and video metadata via yt-dlp. Transcripts are chunked by natural segment boundaries, not fixed token windows — preserving semantic coherence.

SceneFrameExtractor uses ffmpeg's perceptual scene-change filter (select='gt(scene,{threshold})') rather than fixed-interval sampling. This is deliberate: fixed intervals waste tokens on static frames (slides held for 30s), while scene-change detection captures transitions — the moments of highest information density. The threshold (default 0.4) is configurable.

VisionDescriber sends detected frames to a vision-capable LLM (GPT-4o, Claude, Gemini — auto-detected via API key priority). Frame descriptions are plain prose, not structured JSON, to maximise the LLM's descriptive latitude.

Why this matters: A transcript of a coding tutorial misses the code on screen. Scene-change vision capture recovers that signal without the token cost of dense fixed-interval sampling.


2. WikiEngine — The Novel Core ⭐

Inspired by the Karpathy LLM Wiki pattern, this is the most architecturally distinctive component.

WikiExtractor takes the combined transcript + frame descriptions and prompts an LLM to extract four typed knowledge objects:

Type Semantics Update Policy
video Immutable per-video summary + timestamps Write-once
entity People, tools, companies Append-only — new references added, never overwritten
topic Broad themes (e.g. "Scaling Laws") Synthesis rewritten; per-video contributions immutable
concept Specific ideas (e.g. "RLHF") Synthesis rewritten; per-video contributions immutable

WikiEngine handles merge semantics — when a new video references an existing entity or concept, it integrates the new evidence without destroying prior contributions. This is a CRDT-like append model for knowledge, not a vector store replacement index.

Why this matters: Vector stores are retrieval indexes — they don't synthesize. Two videos about "attention mechanisms" produce two isolated chunks. The WikiEngine merges them into a single concept-attention-mechanisms page with a synthesis that evolves as evidence accumulates. Knowledge compounds.

Version history is maintained for all non-immutable pages — every synthesis rewrite is snapshotted, enabling full auditability.


3. Storage Layer

FileWikiRepository stores wiki pages as JSON on disk, one file per page. Chosen over a document DB deliberately:

  • Human-readable and git-diffable
  • Trivially exportable to markdown/HTML
  • Schema evolution without migrations

SQLite FTS5 maintains a parallel search index over page titles, tags, and content. Chosen over a vector store because:

  • Zero embedding cost at query time
  • Deterministic, auditable results
  • Sub-millisecond latency at thousands of pages

Why not ChromaDB/Pinecone? At wiki scale, BM25-style keyword search over compiled knowledge pages outperforms semantic similarity over raw chunks — the wiki pages are already semantically rich by construction.


4. Hybrid Retrieval Agent ⭐

The ask command uses a deliberate two-stage pattern:

  1. FTS5 keyword search — narrows the full wiki to a small candidate set (milliseconds, zero LLM cost)
  2. LLM agent — receives candidates + the wiki table of contents, reasons about relevance, synthesizes a grounded answer with source citations

Why this matters over RAG: Standard RAG retrieves chunks and generates. The agent here retrieves compiled knowledge pages and reasons. The wiki TOC gives the agent structural awareness of what knowledge exists — enabling it to correctly say "I don't have information about X" rather than hallucinating from weak chunk matches.


5. MCP Server

Exposes all subsystems as tools consumable by any MCP-compatible client. Report and synthesis tools use a passthrough pattern — returning structured data for the client's own LLM to analyse, rather than making a second LLM call server-side. This avoids double-billing and lets the client model apply its own reasoning style.


Key Design Decisions

Decision Alternative Considered Reason
Scene-change frame extraction Fixed-interval sampling Higher signal/token ratio
Wiki knowledge model Vector store chunks Knowledge compounds; no re-discovery per query
FTS5 retrieval Embedding similarity Compiled wiki pages are already semantic
File-based wiki storage SQLite/document DB Human-readable, git-diffable, zero migrations
Append-only entity updates Full rewrite Source attribution preserved; full auditability
Passthrough MCP reports Server-side LLM Avoids double-billing; client model reasons

✨ Features

Feature CLI MCP Server
Add/remove YouTube videos
Wiki knowledge base (auto-built)
Scene-change frame extraction + vision analysis
Full-text wiki search (FTS5)
Agentic Q&A over wiki
Browse wiki pages (entities, topics, concepts)
Wiki version history
Wiki export (markdown, HTML)
Illustrated reports (single & cross-video) ✅ (BYOK) ✅ (passthrough)
YouTube discovery + clustering ✅ (BYOK)
Cross-video synthesis ✅ (BYOK) ✅ (passthrough)
Text-only processing mode

BYOK = Bring Your Own Key (Anthropic, OpenAI, or Google) Passthrough = The MCP client's own LLM does the analysis


📦 Installation

Prerequisites

  • Python 3.12 or 3.13
  • ffmpeg — required for frame extraction (install guide)

Recommended: pipx

pipx install mcptube --python python3.12

Alternative: pip

python3.12 -m venv venv
source venv/bin/activate
pip install mcptube

Verify installation

mcptube --help

🚀 Quick Start

# 1. Add a video (builds wiki automatically)
mcptube add "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# 2. Add with text-only processing (cheaper, faster)
mcptube add "https://www.youtube.com/watch?v=abc123" --text-only

# 3. Browse the wiki
mcptube wiki list
mcptube wiki show "video-dQw4w9WgXcQ"

# 4. Search the knowledge base
mcptube search "main topic"

# 5. Ask a question (agentic retrieval over wiki)
mcptube ask "What are the key ideas discussed?"

# 6. View the table of contents
mcptube wiki toc

💡 Always wrap multi-word arguments in double quotes.


📖 CLI Reference

Library Management

Command Description Example
mcptube add "<url>" Ingest video + build wiki (full analysis) mcptube add "https://youtu.be/dQw4w9WgXcQ"
mcptube add "<url>" --text-only Ingest without vision processing mcptube add "https://youtu.be/abc" --text-only
mcptube list List all videos with tags mcptube list
mcptube info <query> Show full video details (transcript, chapters) mcptube info 1
mcptube remove <query> Remove video + clean wiki references mcptube remove 1

<query> can be a video index number, video ID, or partial title.


Wiki Knowledge Base

Command Description Example
mcptube wiki list Browse all wiki pages mcptube wiki list
mcptube wiki list --type <type> Filter by type: video, entity, topic, concept mcptube wiki list --type concept
mcptube wiki list --tag <tag> Filter by tag mcptube wiki list --tag AI
mcptube wiki show <slug> Read a specific wiki page in full mcptube wiki show "entity-openai"
mcptube wiki search "<query>" Full-text search across all wiki pages mcptube wiki search "attention"
mcptube wiki toc Table of contents (all pages, compact) mcptube wiki toc
mcptube wiki history <slug> Version history for a wiki page mcptube wiki history "topic-ml"
mcptube wiki export Export all pages as markdown (default) mcptube wiki export -o wiki_export/
mcptube wiki export --format html Export all pages as single HTML file mcptube wiki export --format html -o wiki.html
mcptube wiki export --page <slug> Export a single page mcptube wiki export --page "entity-openai" -o openai.md

Search & Ask

Command Description Example
mcptube search "<query>" Full-text search, returns page list mcptube search "transformers"
mcptube ask "<question>" Agentic Q&A over wiki (BYOK) mcptube ask "What is self-attention?"

Frames

Command Description Example
mcptube frame <query> <timestamp> Extract frame at exact timestamp (seconds) mcptube frame 1 30.5
mcptube frame-query <query> "<description>" Extract frame by transcript match mcptube frame-query 1 "when they show the diagram"

Analysis & Reports (BYOK)

Command Description Example
mcptube classify <query> LLM classify + tag a video mcptube classify 1
mcptube report <query> Generate illustrated report for one video mcptube report 1
mcptube report <query> --focus "<topic>" Guide report with a focus query mcptube report 1 --focus "RLHF"
mcptube report <query> --format html -o <file> Save report as HTML mcptube report 1 --format html -o report.html
mcptube report-query "<topic>" Cross-video report on a topic mcptube report-query "scaling laws"
mcptube report-query "<topic>" --tag <tag> Cross-video report filtered by tag mcptube report-query "AI" --tag research
mcptube report-query "<topic>" -o <file> Save cross-video report mcptube report-query "AI" --format html -o report.html
mcptube synthesize-cmd "<topic>" -v <id> -v <id> Cross-video theme synthesis mcptube synthesize-cmd "RLHF" -v id1 -v id2
mcptube synthesize-cmd "<topic>" -v <id> --format html -o <file> Save synthesis as HTML mcptube synthesize-cmd "AI" -v id1 --format html -o out.html
mcptube discover "<topic>" Search YouTube, cluster results (no ingest) mcptube discover "prompt engineering"

Server

Command Description
mcptube serve Start MCP server over HTTP (default 127.0.0.1:9093)
mcptube serve --stdio Start MCP server over stdio (for Claude Desktop)
mcptube serve --host <host> --port <port> Custom host/port
mcptube serve --reload Hot-reload mode for development

🧩 Wiki Page Types

When you ingest a video, mcptube-vision builds four types of wiki pages:

Page Type Created From Update Policy
Video Each ingested video Write-once (immutable)
Entity People, companies, tools mentioned Append-only (new references added)
Topic Broad themes (e.g., "Machine Learning") Synthesis rewritten, per-video contributions immutable
Concept Specific ideas (e.g., "Scaling Laws") Synthesis rewritten, per-video contributions immutable

Principle: Raw source content (what was said/shown in each video) is never modified. Only synthesis summaries evolve as new videos are added. Version history is maintained for all changes.


🔍 How Search Works (Hybrid Retrieval)

mcptube-vision uses a two-step hybrid approach:

  1. SQLite FTS5 — keyword search narrows thousands of wiki pages to a handful of candidates (milliseconds, zero LLM cost)
  2. LLM Agent — reads candidates + wiki table of contents, reasons about relevance, synthesizes an answer

This gives you the speed of keyword search with the intelligence of an LLM agent.


👁️ Vision Pipeline

When you ingest a video without --text-only, mcptube-vision:

  1. Extracts key frames using ffmpeg scene-change detection (select='gt(scene,0.4)')
  2. Sends frames to a vision-capable LLM (GPT-4o, Claude, Gemini) for description
  3. Combines frame descriptions with transcript in the knowledge extraction pass

This captures visual content (slides, code, diagrams, demos) that transcripts alone miss.


🔌 MCP Client Setup

mcptube exposes 25+ MCP tools via two transports:

Transport How it works Used by
Streamable HTTP (/mcp) Client connects to a running mcptube server VS Code, Claude Code, Cursor, Windsurf, Codex, Gemini CLI
stdio MCP client spawns mcptube as a child process Claude Desktop

ℹ️ The MCP server is currently available for local use only. You must run mcptube serve locally or let the client spawn it.


VS Code + GitHub Copilot ✅ Tested

Open Cmd+Shift+PMCP: Open User Configuration and add:

{
  "servers": {
    "mcptube": {
      "url": "http://127.0.0.1:9093/mcp"
    }
  }
}

Then start the server in a terminal:

mcptube serve

Claude Code ✅ Tested

claude mcp add mcptube --transport http http://127.0.0.1:9093/mcp

Then start the server in a separate terminal:

mcptube serve

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

If installed via pipx (recommended):

{
  "mcpServers": {
    "mcptube": {
      "command": "mcptube",
      "args": ["serve", "--stdio"]
    }
  }
}

If installed in a virtual environment:

{
  "mcpServers": {
    "mcptube": {
      "command": "/full/path/to/.venv/bin/mcptube",
      "args": ["serve", "--stdio"]
    }
  }
}

No separate server needed — Claude Desktop spawns the process automatically.


Cursor

Create or edit ~/.cursor/mcp.json (global) or .cursor/mcp.json (project-scoped):

{
  "mcpServers": {
    "mcptube": {
      "url": "http://127.0.0.1:9093/mcp"
    }
  }
}

Then start the server:

mcptube serve

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "mcptube": {
      "serverUrl": "http://127.0.0.1:9093/mcp"
    }
  }
}

Then start the server:

mcptube serve

OpenAI Codex

Edit ~/.codex/config.toml:

[mcp_servers.mcptube]
url = "http://127.0.0.1:9093/mcp"

Then start the server:

mcptube serve

Gemini CLI

Edit ~/.gemini/settings.json:

{
  "mcpServers": {
    "mcptube": {
      "httpUrl": "http://127.0.0.1:9093/mcp"
    }
  }
}

Then start the server:

mcptube serve

Verify Connection

Once connected, ask your MCP client:

use mcptube. list all videos in my library

It should call the list_videos tool and return results.

MCP Tools

Tool Description
add_video Ingest video + build wiki
list_videos List library
remove_video Remove video + clean wiki
wiki_list Browse wiki pages
wiki_show Read a wiki page
wiki_search Full-text search
wiki_toc Table of contents
wiki_ask Agentic Q&A
wiki_history Version history
get_frame Extract frame (inline image)
get_frame_by_query Frame by transcript match
classify_video Get metadata for classification
generate_report Get data for single-video report
generate_report_from_query Get data for cross-video report
synthesize Get data for theme synthesis
discover_videos Search YouTube
ask_video Single-video Q&A data
ask_videos Multi-video Q&A data

⚙️ Configuration

All settings can be overridden via environment variables prefixed with MCPTUBE_:

Setting Default Env Var
Data directory ~/.mcptube MCPTUBE_DATA_DIR
Server host 127.0.0.1 MCPTUBE_HOST
Server port 9093 MCPTUBE_PORT
Default LLM model gpt-4o MCPTUBE_DEFAULT_MODEL

BYOK API Keys

Set one or more to enable LLM features:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AI..."

Auto-detection priority: Anthropic → OpenAI → Google.


📁 Data Layout

~/.mcptube/
├── mcptube.db          # Video metadata (SQLite)
├── wiki.db             # FTS5 search index (SQLite)
├── wiki/
│   ├── video/          # Video pages (JSON)
│   ├── entity/         # Entity pages (JSON)
│   ├── topic/          # Topic pages (JSON)
│   ├── concept/        # Concept pages (JSON)
│   └── _history/       # Version history
└── frames/
    ├── <id>_<ts>.jpg   # Single extracted frames
    └── <id>_scenes/    # Scene-change frames + metadata

🧪 Development

git clone https://github.com/0xchamin/mcptube.git
cd mcptube
git checkout vision
python3.12 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
pytest

🗺️ Roadmap

  • Wiki knowledge engine (entities, topics, concepts)
  • Scene-change frame extraction + vision analysis
  • Hybrid retrieval (FTS5 + agentic)
  • CLI + MCP server
  • Playlist/series support
  • Web app with early access sign-up
  • Token-based payment integration

📄 License

MIT — see LICENSE for details.