惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Simon Willison's Weblog
Simon Willison's Weblog
Engineering at Meta
Engineering at Meta
宝玉的分享
宝玉的分享
有赞技术团队
有赞技术团队
Last Week in AI
Last Week in AI
博客园 - Franky
云风的 BLOG
云风的 BLOG
D
Docker
The Register - Security
The Register - Security
V
V2EX
The GitHub Blog
The GitHub Blog
B
Blog
N
Netflix TechBlog - Medium
WordPress大学
WordPress大学
T
The Blog of Author Tim Ferriss
Microsoft Security Blog
Microsoft Security Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 叶小钗
人人都是产品经理
人人都是产品经理
J
Java Code Geeks
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 司徒正美
Google Online Security Blog
Google Online Security Blog
U
Unit 42
K
Kaspersky official blog
MongoDB | Blog
MongoDB | Blog
Cisco Talos Blog
Cisco Talos Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
T
Tor Project blog
B
Blog RSS Feed
Security Latest
Security Latest
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Blog — PlanetScale
Blog — PlanetScale
T
Threat Research - Cisco Blogs
Recent Announcements
Recent Announcements
小众软件
小众软件
Stack Overflow Blog
Stack Overflow Blog
I
Intezer
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园 - 【当耐特】
Recorded Future
Recorded Future
Scott Helme
Scott Helme
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The Cloudflare Blog
AI
AI
G
GRAHAM CLULEY
L
LangChain Blog
Google DeepMind News
Google DeepMind News
L
LINUX DO - 最新话题

Hacker News - Newest: "AI"

AI can't read an investor deck AI as an attorney? Student uses ChatGPT, Gemini to sue UW over alleged racial discrimination Hacking MCP Servers in AI Systems – The Rug Pull: Tool Changes After Approval GitHub - MeepCastana/KubeezCut: Free Web based video editor GitHub - GenAI-Gurus/awesome-eu-ai-act: Curated tools, official sources, OSS, templates, and guides for EU AI Act compliance. Can AI judge journalism? A Thiel-backed startup says yes, even if it risks chilling whistleblowers Coming soon: 10 Things That Matter in AI Right Now DARPA built an AI to fact-check enemy weapons claims What explains heterogeneity in AI adoption? When AI Meets Muscle: Context-Aware Electrical Stimulation Promises a New Way to Guide Human Movements - Department of Computer Science AI Changed How We Build. It Did Not Change What Matters. Linux rules on using AI-generated code - Copilot is OK, but humans must take 'full responsibility for the… Meta spins up AI version of Mark Zuckerberg to engage with employees Code Mode: Let Your AI Write Programs, Not Just Call Tools | TanStack Blog GitHub - Delavalom/graft: Go framework for building AI agents. Type-safe tools, multi-provider (OpenAI, Anthropic, Gemini, Bedrock), zero vendor SDKs. India's TCS tops estimates, says new AI models did not dent services demand Gen Z's fading AI hype Strong feeling: we are in a folded AI reality GitHub - machinarii/total-recall-catalog: A reference catalog of latest knowledge retrieval, memory & RAG systems GitHub - mensfeld/code-on-incus: Give each AI agent its own isolated machine with root, Docker, and systemd. Active defense detects and stops threats automatically.. Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks GitHub - immartian/bellamem: Persistent belief-graph memory for AI agents. Retrieves decisive context by importance — not recency, not RAG, not /compact. recursive-mode: The Repo-Native Operating System for AI Engineering After the attack on Sam Altman's home, will AI CEO's go on the offensive? The biggest advance in AI since the LLM Opus 4.6 vs GPT 5.4 One Prompt Unity World Generation Test “AI polls” are fake polls Client Challenge Can AI be a 'child of God'? Inside Anthropic's meeting with Christian leaders How to Switch AI Chatbots and Why You Might Want To GitHub - MattMessinger1/agentic_refund_guardrail: Safe refund policy layer for AI agents — Python + TypeScript. Same behavior, shared tests. Adam/papers/emergent_values_whitepaper.md at master · strangeadvancedmarketing/Adam Ask HN: How do you stop playing 20 questions with your AI coding tools How far can automation and AI support psychotherapy? - @theU GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits A Mac Studio for Local AI — 6 Months Later A History of the Early Years of AI at the University of Edinburgh Why AI Coding Tools Still Feel Stuck on Localhost MSN AI Datacenters Are Becoming Strategic Targets twitter.com Penn Researchers Use AI to Surface Unreported GLP-1 Side Effects in Reddit Posts Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 AI models are terrible at betting on soccer—especially xAI Grok GitHub - xialeistudio/echoic GitHub - HimashaHerath/github-dev-wrapped: AI-powered weekly GitHub activity reports deployed to GitHub Pages GitHub - alejandrobalderas/claude-code-from-source: Architecture, patterns & internals of Anthropic's AI coding agent — reverse-engineered from source maps AI and Tech brief: Ireland ascendant GitHub - Titovilal/context0: Context0 - Never Surrender Training for a Marathon with an AI Coach: What Worked and What Didn't Cyber Pulse: Agentic Intel - Apps on Google Play I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs Gen Z workers are so fearful AI will take their job they’re intentionally sabotaging their company’s AI rollout | Fortune How AI Is Reimagining the Game of Golf–For Both Players and Courses GitHub - nattergabriel/reseed: A CLI tool for managing and distributing agent skills across projects Is SVG the final frontier? My AI workflow evolved from prompts to a near-autonomous workflow MLSharp Help - 3DGS Viewer & Generator I put my cognitive field based AI's runtime on GitHub Is Numble the first AI-proof game? A3: Kubernetes for autonomous AI agent fleets | Emergent Principles Deepali Vyas ("The Elite Recruiter") GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Unionized ProPublica staff are on strike over AI, layoffs, and wages Unleashing the Advantage of Quantum AI We're heading for an AI-fueled 'dementia crisis,' brain scientist warns The AI-Assisted Breach of Mexico's Government Infrastructure [pdf] GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. MSN GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way What leaked "SteamGPT" files could mean for the PC gaming platform's use of AI AI is the boss at this retail store. What could go wrong? GitHub - Wuzu11517/agentic-proxy: Local proxy meant to help reduce With Drones, Geophysics and ArtificiaI Intelligence, Researchers Prepare to Do Battle Against Land Mines A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - inevolin/resume-cli: Hit Claude usage limits? Resume any AI coding session elsewhere. Switch tools at zero friction. GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript This Startup Wants You to Pay Up to Talk With AI Versions of Human Experts Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949 WordPress 7.0: The Good, the AI, and the Still Missing AI on the couch: Anthropic gives Claude 20 hours of psychiatry IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures AI Agents Know About Supabase. They Don't Always Use It Right. The history and future of AI at Google, with Sundar Pichai Inside an AI‑enabled device code phishing campaign How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines AI for Systems: Using LLMs to Optimize Database Query Execution Forecasting the Economic Effects of AI Introducing Tinker: Play with AI, bring your ideas to life AI sheds light on an ancient gaming mystery People really hate AI but not as much as Iran—or Democrats | Fortune What is an AI Product Engineer? Phoebe Gates wants her $185 million AI startup to succeed with 'no ties to my privilege or my last name': 'I have a chip on my shoulder' | Fortune
GitHub - Ethereal-Agents/arxiv-scholar: Retrieval system over the arXiv corpus
dubeyaayush0 · 2026-06-16 · via Hacker News - Newest: "AI"
title Arxiv Scholar
emoji 📚
colorFrom blue
colorTo purple
sdk docker
pinned false

A high-performance Retrieval-Augmented Generation (RAG) system for AI Engineering research.

ArXiv Scholar is an end-to-end pipeline that ingests, parses, chunks, and embeds academic papers from arXiv into a hybrid vector database — enabling fast semantic search over scientific documents. Built from scratch without high-level abstraction frameworks (no LangChain) for full architectural control and transparent failure modes.

Status: ~5,600 AI Engineering papers indexed on Qdrant Cloud. Streaming API live. Scaling via agent-driven ingestion planned.


Table of Contents

  • Architecture
  • Key Features
  • Project Structure
  • Tech Stack
  • Getting Started
  • Usage
  • Evaluation
  • Contributing
  • License

Architecture

flowchart TD
    subgraph Ingestion Pipeline
        A[ArxivUnifiedEngine] -->|Downloads PDFs from GCS| B(LocalDirectoryReader / GCSBucketReader)
        B -->|Yields Documents| C(LayoutAwareChunker)
        C -->|Produces Chunks| D{Embedding Service}
        D -->|Dense Vectors| E[FastEmbedEmbedder<br/><small>BAAI/bge-m3</small>]
        D -->|Sparse Vectors| F[SparseBM25Embedder<br/><small>Qdrant/bm25</small>]
        E --> G[(Qdrant Cloud)]
        F --> G
    end

    subgraph Retrieval Pipeline
        H[User Query] --> I{ML Query Router}
        I -->|Simple| J[Direct Hybrid Search]
        I -->|Complex/Comparative| K[LLM Decomposition + Metadata Extraction]
        I -->|Conceptual| L[HyDE - Hypothetical Document Embedding]
        K -->|Sub-queries + Filters| J
        L -->|Generated Abstract| J
        J -->|Dense + Sparse Fetch| G
        G -->|Min-Max Normalized + Weighted Fusion| N[Final Results]
    end
Loading

Note on reranking: A cross-encoder reranker (jina-reranker-v1-tiny-en) is implemented in the codebase but is disabled by default (USE_RERANKER=False). During benchmarking, the reranker caused performance degradation on the current corpus size and was turned off. The code is retained for future evaluation at larger scale.

Pipeline Stages

Stage Component Description
Download ArxivUnifiedEngine Streams PDFs from the public arxiv-dataset GCS bucket in configurable batches. Maintains a JSON cursor (current_month, last_file) for resumable, crash-safe ingestion across YYMM folders.
Parsing LocalDirectoryReader / GCSBucketReader Extracts raw text from PDFs via PyMuPDF. Computes SHA-256 hashes for deduplication and extracts arXiv IDs from filenames using regex. GCS reader operates fully in-memory for serverless deployments.
Chunking LayoutAwareChunker Uses Docling to visually parse PDF layouts (headers, paragraphs, tables) and produce semantically grouped chunks. Falls back to SlidingWindowChunker for oversized blocks or when Docling is unavailable.
Embedding SentenceTransformerEmbedder + SparseBM25Embedder Generates dense vectors (BAAI/bge-m3) via SentenceTransformers (PyTorch) and sparse BM25 vectors via FastEmbed (ONNX) concurrently.
Storage QdrantVectorStore Upserts chunks with deterministic UUID-v5 point IDs to Qdrant Cloud. Supports both cloud mode (URL + API key) and in-memory mode for testing.
Retrieval HybridRetriever Fetches dense and sparse results independently, applies min-max normalization, and fuses scores with configurable weights (default: dense=1.0, sparse=0.3).

Key Features

  • Hybrid Search — Combines dense semantic embeddings with sparse BM25 keyword matching, fused via weighted min-max normalization for superior recall over either method alone.
  • Intelligent Query Routing — A hybrid ML + heuristic router (<1ms) classifies incoming queries into Direct, Decompose, or HyDE paths. Includes regex-based Hard Overrides for guaranteed metadata routing (e.g., year filtering) without ML hallucinations, plus short-query detection that auto-routes to HyDE.
  • LLM-Powered Query Decomposition — Complex queries are split into atomic sub-queries with strict metadata filters (e.g., publication year) extracted via JSON from an LLM. Filters are applied natively at the Qdrant Prefetch level.
  • Dynamic Compute Budgeting — Sub-queries from decomposition are fetched concurrently and pooled before global deduplication and scoring. The fetch budget is dynamically allocated across sub-queries.
  • Layout-Aware PDF Parsing — Docling-based visual document understanding preserves the semantic structure of academic papers (sections, tables, equations) instead of naive text splitting.
  • Crash-Safe Batch Ingestion — Cursor-based state management allows the pipeline to resume from the exact point of failure across large ingestion runs.
  • Streaming API — FastAPI endpoint with Server-Sent Events (SSE) streams retrieved sources and LLM-synthesized answers token-by-token.

Project Structure

arxiv-scholar/
├── main.py                          # Full ingestion pipeline orchestrator
├── app.py                           # Streamlit chat UI
├── configs/
│   └── config.py                    # Centralized env-var-backed configuration
├── src/arxiv_scholar/
│   ├── schema.py                    # Core data models (Document, Chunk)
│   ├── api/
│   │   ├── schema.py                # REST API request/response models (SSE events)
│   │   └── server.py                # FastAPI streaming endpoint (POST /api/v1/query)
│   ├── chunking/
│   │   ├── base.py                  # Abstract BaseChunker interface
│   │   ├── layout.py                # Docling-based layout-aware chunker
│   │   └── sliding_window.py        # Fixed-size sliding window fallback chunker
│   ├── download/
│   │   └── arxiv_ingestion.py       # GCS-backed PDF downloader with cursor state
│   ├── embedding/
│   │   ├── base.py                  # Abstract BaseEmbedder interface
│   │   ├── fastembed_embedder.py    # ONNX CPU embedder (dense + sparse BM25)
│   │   └── st_embedder.py           # SentenceTransformer embedder (GPU)
│   ├── ingestion/
│   │   ├── base.py                  # Abstract DocumentReader interface
│   │   ├── local.py                 # Local filesystem PDF reader (PyMuPDF)
│   │   └── gcs.py                   # In-memory GCS bucket reader (serverless)
│   ├── llm/
│   │   └── service.py               # LLM client (decomposition, HyDE, synthesis)
│   ├── retrieval/
│   │   ├── retrieval.py             # Hybrid retriever with weighted fusion
│   │   ├── orchestrator.py          # Query orchestrator (routes → retrieves → fuses)
│   │   └── router.py                # ML + heuristic query router
│   └── storage/
│       ├── base.py                  # Abstract BaseVectorStore interface
│       └── qdrant_store.py          # Qdrant client (upsert, search, hybrid search)
├── scripts/
│   ├── generate_arxiv_manifest.py   # Paper selection criteria & manifest generator
│   ├── download_qdrant.sh           # Qdrant binary installer
│   ├── generate_eval_dataset.py     # Evaluation dataset generator
│   ├── run_benchmarks.py            # Retrieval benchmark runner
│   └── ...                          # Various ingestion and import utilities
├── colab/
│   ├── batch_gcs_to_drive.py        # Colab script: batch download from GCS
│   └── generate_embedded_dataset.py # Colab script: embed and push to Qdrant Cloud
├── notebooks/                       # Jupyter notebooks for development & testing
├── tests/                           # Unit and integration tests
├── docs/                            # GitHub Pages website
├── Dockerfile                       # Production container (HF Spaces / Cloud Run)
├── docker-compose.yml               # Local Qdrant service definition
└── pyproject.toml                   # Project metadata and dependencies

Tech Stack

Layer Technology Purpose
Dense Embedding BAAI/bge-m3 via SentenceTransformers (PyTorch) Semantic vectors
Sparse Embedding Qdrant/bm25 via FastEmbed BM25 term-frequency vectors for keyword matching
Vector Database Qdrant Cloud Hybrid storage with server-side query batching
PDF Parsing PyMuPDF + Docling Text extraction and layout-aware chunking
API FastAPI + Uvicorn Streaming SSE endpoint
LLM Configurable (OpenAI-compatible API) Query decomposition, HyDE generation, answer synthesis
Query Router scikit-learn + regex heuristics Sub-millisecond query classification
Orchestration Pure Python (no LangChain) Full architectural control

Getting Started

Prerequisites

  • Python ≥ 3.10
  • uv (recommended) or pip

Installation

# Clone the repository
git clone https://github.com/dubeyaayush07/arxiv-scholar.git
cd arxiv-scholar

# Create a virtual environment and install dependencies
uv venv && source .venv/bin/activate
uv pip install -e .

Environment Variables

Create a .env file or export the following:

# Required — Qdrant Cloud connection
export QDRANT_URL="your_qdrant_cloud_url"
export QDRANT_API_KEY="your_qdrant_api_key"
export QDRANT_COLLECTION="Arxiv-Scholar"

# Required for LLM features (decomposition, HyDE, answer synthesis)
export LLM_API_KEY="your_key_here"
export LLM_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"  # or any OpenAI-compatible endpoint
export LLM_MODEL="claude-haiku-4-5"

# Optional overrides (defaults shown)
export EMBEDDING_BACKEND="fastembed"            # or "sentence-transformers"
export EMBEDDING_MODEL="BAAI/bge-m3"
export SPARSE_EMBEDDING_MODEL="Qdrant/bm25"
export USE_RERANKER="False"                     # Disabled — causes performance degradation
export DENSE_WEIGHT="0.6"
export SPARSE_WEIGHT="0.4"

# For local Qdrant (alternative to cloud)
export QDRANT_HOST="localhost"
export QDRANT_PORT="6333"

Usage

Ingestion Pipeline

The full ingestion pipeline is implemented in main.py. It downloads PDFs from the arXiv GCS bucket, parses them with Docling, chunks them, generates dual embeddings, and upserts to Qdrant.

# Trial run (downloads 2 PDFs, processes in-memory Qdrant)
python main.py --trial

# Production run (continuous batch ingestion)
python main.py

How we actually ingested data: Due to local compute constraints, the initial ~5,600 paper corpus was ingested via Google Colab. The colab/batch_gcs_to_drive.py script batches PDFs from GCS, and colab/generate_embedded_dataset.py generates embeddings and pushes them to Qdrant Cloud. See scripts/generate_arxiv_manifest.py for the exact paper selection criteria (keyword groups, domain exclusions, golden terms).

API Server

The API server is implemented in src/arxiv_scholar/api/server.py. It exposes a streaming SSE endpoint that routes queries, retrieves results, and synthesizes answers via LLM.

Live hosted endpoint (Hugging Face Spaces):

curl -N -X POST "https://trinetra-dev-arxiv-scholar.hf.space/api/v1/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is contrastive learning?",
    "limit": 5,
    "use_reranker": false
  }'

Run locally:

# Start the server
uvicorn src.arxiv_scholar.api.server:app --reload

# Or via Docker
docker build -t arxiv-scholar .
docker run -p 7860:7860 --env-file .env arxiv-scholar
# Query your local instance
curl -N -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "attention mechanisms in transformers", "limit": 10, "use_reranker": false}'

Running Tests


Evaluation

The retrieval pipeline is evaluated using programmatic metrics via scripts/run_benchmarks.py.

# Generate evaluation dataset
python scripts/generate_eval_dataset.py

# Run benchmarks
python scripts/run_benchmarks.py

Paper Selection Criteria

Papers are curated using a multi-dimensional filtering system defined in scripts/generate_arxiv_manifest.py:

  • Target Categories: cs.CL, cs.AI, cs.IR, cs.LG, cs.SE
  • Date Filter: Papers updated after 2022-01-01
  • Keyword Groups: RAG & Retrieval, Large Language Models, Agents & Reasoning, Training & Alignment, Safety & Quality, Inference & Systems, AI Developer Tools
  • Inclusion Logic:
    • Golden Term Bypass: Ultra-high-signal terms (vLLM, SWE-bench, FlashAttention, etc.) → auto-include
    • Regular: Matches ≥ 3 keyword groups
    • Rescued: Matches exactly 2 groups + ≥ 2 AI engineering meta-terms (pipeline, benchmark, framework, etc.)
  • Domain Exclusions: Medical, physics, quantum, agriculture, autonomous driving, pure math theory, etc.

Contributing

Contributions are welcome. Please open an issue first to discuss proposed changes.

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/your-feature).
  3. Commit your changes (git commit -m "feat: add your feature").
  4. Push to your fork and open a Pull Request.

License

This project is open-source under the MIT License.