惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 三生石上(FineUI控件)
T
Threat Research - Cisco Blogs
月光博客
月光博客
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
爱范儿
爱范儿
Hugging Face - Blog
Hugging Face - Blog
腾讯CDC
云风的 BLOG
云风的 BLOG
D
Docker
罗磊的独立博客
U
Unit 42
博客园 - 聂微东
人人都是产品经理
人人都是产品经理
P
Proofpoint News Feed
博客园 - Franky
Apple Machine Learning Research
Apple Machine Learning Research
MyScale Blog
MyScale Blog
B
Blog RSS Feed
美团技术团队
J
Java Code Geeks
S
Securelist
Cyberwarzone
Cyberwarzone
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
NISL@THU
NISL@THU
Security Latest
Security Latest
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Recorded Future
Recorded Future
Hacker News - Newest:
Hacker News - Newest: "LLM"
L
LINUX DO - 热门话题
Recent Announcements
Recent Announcements
Last Week in AI
Last Week in AI
A
About on SuperTechFans
MongoDB | Blog
MongoDB | Blog
Spread Privacy
Spread Privacy
T
Tenable Blog
I
Intezer
N
News | PayPal Newsroom
大猫的无限游戏
大猫的无限游戏
A
Arctic Wolf
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
V
V2EX - 技术
S
Schneier on Security
S
SegmentFault 最新的问题
Latest news
Latest news
宝玉的分享
宝玉的分享
V
Visual Studio Blog
V
V2EX
T
Tor Project blog
C
Comments on: Blog

DEV Community

NetworkX vs CSR + TensorPrimitives: PageRank on 28M Edges The compiler caught a lot. It didn't catch enough. CareSync: A Local Health Memory Agent for Family Caregivers Hermes Agent: How Nous Research Built an AI That Actually Learns from Its Own Building a Local AI Market Trader with Hermes Agent How to Handle JavaScript-Rendered Pages Without a Full Browser Why Your Scraper Works in the Browser But Fails in Python pocket-db vs lowdb vs LokiJS: an honest embedded database benchmark CLAUDE.md Security Rules: What to Add Now That Claude Code Reviews Your Code 🛡️ Building PatchPoint: Unifying DevOps Security Silos with Coral SQL Agent Substrate: The Agentic AI Isolation Layer On K8s loadComponent vs loadChildren in Angular 19: Choosing the Right Lazy-Loading Boundary Hermes Agent Challenge Build a Production RAG System on AWS Bedrock from Scratch Context-Aware Code Summarizer 25/30 Days System Design Questions! MSW vs Hosted Mock APIs: When To Use Each How to Build Long-Running AI Agents with Google Gen AI SDK The Statistical Casino Building a 100% Client-Side HEIC to JPG Converter: Zero Servers, Zero Uploads I Gave an AI Agent My Vacation. It Planned Better Than I Did. What Nobody Tells You About Running Hermes Agent Locally (M-Series Mac Edition) I taught Hermes Agent to predict which API changes will break my system StuxCTF — TryHackMe Writeup How I built a 12-section Shopify page using only AI agents (and a Cowork audit) How to Build an agent using coral LeetCode Solution: 3. Longest Substring Without Repeating Characters How WhatsApp Works Without Internet: Offline Messaging and Synchronization Explained Designing an Open-Source Toolkit for AI Agent Resources Building a Psychological Safety Framework for Engineering Teams I built a freelance client + invoice tracker in ~3 hours using Cursor — here's everything I shipped How to Parse Invoices in Python Using an API (2026 Guide) Shifting from Mobile to Web: How I Built a 3-Pane Desktop AI Interface with Expo Web & FastAPI LeetCode Solution: 17. Letter Combinations of a Phone Number It's time to get familiar with what FinOps for AI is Four themes for a terminal you read more than you syntax-highlight Building Truly Cross-Platform Claude Code Hooks with Go, Bash, PowerShell, WSL, and Git-Bash I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB LeetCode Solution: 15. 3Sum Lottie vs Framer Motion: Which Should You Use? Why Great Software Engineers Get Rejected Before a Human Reads Their Resume Ladybug x Icebug notebooks are out!! The notebooks explain how icebug-format brings graph database (ladybugDB) and high performance analytics(icebug) under one roof. https://github.com/LadybugDB/ladybug-icebug-notebooks/ The Industry Needs an Open Reasoning Spec. Seven Papers Explain What Goes In It. Getting Started with eslint-plugin-mongodb-security Modern Web Security Attacks Every Developer Must Know (2026 Guide) Clickjacking COMPUTER ARCHITECTURE: THEORIES A plugin for Observability + Budget Guardrails built with Hermes Agent Vendor Chunking: The React Optimization I Wish I'd Known Earlier ExtensionsbyBunny I built rails-persona — behavioral analytics for Rails with zero external services Designing Scalable Multi-Tenant SaaS Applications How to monitor a brand across 5 Chinese social platforms with Python in 2026 — the cross-platform dedup problem and how to handle it The Solo Developer Who Ships What Entire Teams Once Built. How I Built VoxCalc — An AI-Inspired Next-Gen Calculator with Flutter, Google ML Kit & Voice NLP How my OS is indexing by Google? From Eclipses to P95 Latency: What the Joseon Dynasty Can Teach Us About Incident Response Building AutoStack.Identity: A Zero-Dependency .NET 10 Library for SAML 2.0, JWT, and XML Signing What a Go Engineer Learns Building Their First Real Python Service System Design - 9.Database Sharding & Replication, How Facebook Serves a Billion Reads Per Second I Spent 15 Months Porting a 20-Year-Old Computational Chemistry Binary to the Cloud. Alone. I shipped the wrong abstraction, then deleted it Week 1: what it looks like when an AI agent runs an open-source project solo I didn't have a PC for my database class, so I built my own T-SQL Sandbox in the browser How to Export Google Patents to CSV (Honest Guide to Every Real Path) Most AI Forgets. Hermes Agent Learns. Building a self-hosted reverse proxy with WireGuard for my homelab behind CGNAT FlockUI is Open for Contributors — Let's Build the Flutter UI Library We Always Wanted What a Port State Control Inspection Actually Looks Like Your RL Agent Failed a 12-Step Task. Which Step Was Wrong? (The Supervision Problem in Agentic RL) Token Budgeting The Fastest Part of Your Stack Is Already Installed: Rethinking Web IDEs How a Small Product Sync Automation Changed Onboarding at Scale A .NET Dinosaur in Web3. Day 18 - Automated Market Maker Micro-Frontends, One Year On: The Workarounds That Made Single-SPA Reliable for Us From Specs to Tickets: Automating Jira Setup with Node.js and the Jira API How to Build a Power BI Financial Dashboard for Healthcare Notes on Federated Learning and Differential Privacy Notes on Serving LLMs with TensorRT-LLM and Triton JWT Explained: What's Actually Inside That Token (with a free decoder) 0% vs 50%: Making a RAG Agent Refuse to Hallucinate Where Tensor-Parallel Inference Hits the NVLink Wall Building a Comprehensive Accessibility Testing Framework for Web Applications I open-sourced a World Cup 2026 prediction model — and tested it honestly Database WAL Bloat Management: The Core Anatomy for Performance WordPress Emails Were Failing Silently on DigitalOcean. Here's What Broke. Reading Belgium's KBO/CBE registry: what the live API returns 🤫 I Built CodeMoji: A VS Code Extension That Turns Code Into Emojis 5 AI Pair Programming Patterns That Actually Speed Up Development LLD Object-Oriented Design: From Requirements to Classes (Bridging Thinking to Domain Modeling) How We Built a CTO-Grade Grafana Dashboard With Codex How We Built a CTO-Grade Grafana Dashboard With Codex T-Slot Bolts and Nuts for Secure Industrial Clamping From Tools to Economic Actors: Why AI Agents Need Independent Financial Infrastructure Designing a Scalable Event-Driven Data Processing Pipeline with Apache Kafka Streams M4 Pro vs M5 Pro: Which Apple Silicon Chip Wins? Calling GET API & Mapping Response in VBCS (Service Connection) How to Connect a Trailhead Org to Salesforce CLI WebAssembly - Why Your Browser Can Finally Do What Desktop Apps Could I built a CLI tool to find worthwhile GitHub issues to contribute to I Built a Global Location Data API with 12M+ Cities — Here's How
pypdf vs PdfPig: Text Extraction at Scale
Milliseconds.dev · 2026-06-01 · via DEV Community

Overview

PDF text extraction is a common pre-processing step in data pipelines — ingesting research papers, legal documents, or reports before embedding or indexing. Both pypdf and PdfPig are pure managed-code parsers: no native binaries, no OCR, no system PDF renderer. They implement the same PDF specification operations in their respective languages.

This makes the benchmark unusually clean: the performance difference is entirely due to language execution speed, not library architecture differences.

Benchmark Setup

200 recent arXiv PDFs (mixed technical papers, 5–40 pages each). Tested on subsets of 10, 50, 100, and 200 files. Both libraries extract all text from all pages; output is validated for page-count agreement and character-count agreement within 15% (pypdf and PdfPig decode whitespace and encoding tables slightly differently).

Results

PDFs Pages Python (pypdf) .NET (PdfPig) Speedup
10 ~120 ~0.9 s ~230 ms 3.9×
50 ~600 ~4.2 s ~810 ms 5.2×
100 ~1,200 ~8.5 s ~1.4 s 6.1×
200 ~2,400 ~17 s ~2.7 s 6.2×

The speedup grows slightly with corpus size, suggesting pypdf has a per-document startup cost that compounds as PdfPig's JIT gets warmer.

Why PdfPig Is Faster

PDF parsing is byte-heavy: every page is a stream of PostScript-like operators (move, show text, set font, etc.). Each operator must be lexed, looked up in a dispatch table, and executed against a graphics state machine.

In Python, each operator dispatch is a Python method call — the CPython bytecode interpreter has overhead per call regardless of what the method does. In .NET, the JIT compiles the dispatch loop to native code the first time it runs; subsequent pages pay only the cost of the actual work.

Additionally, PdfPig's content-stream parser operates on ReadOnlySpan<byte> — zero-copy slicing through the raw page bytes with no intermediate string allocations. pypdf builds Python string objects for each token.

Key Code

// PdfPig — zero-copy span-based page extraction
public Result Extract(string path)
{
    using var doc = PdfDocument.Open(path);
    long chars = 0;
    foreach (var page in doc.GetPages())
        chars += page.Text.Length;
    return new Result(chars, doc.NumberOfPages, Ok: true);
}

Enter fullscreen mode Exit fullscreen mode

# pypdf — Python object per token
reader = pypdf.PdfReader(path)
chars  = 0
for page in reader.pages:
    chars += len(page.extract_text() or "")

Enter fullscreen mode Exit fullscreen mode

The structure is identical. The performance difference is pure language execution speed on the same PDF parsing logic.

Diagrams

PDF extraction time by corpus size — PdfPig stays nearly linear while pypdf steepens

Both scale linearly in page count, but PdfPig's slope is 6× shallower. At 200 files (a typical daily batch job), .NET finishes in under 3 seconds; Python takes 17 seconds.

Speedup multiplier by subset size — gap widens as JIT warms up

The 10-file speedup (3.9×) is lower because .NET's JIT incurs one-time compilation overhead for the PDF parser code paths. By 50 files the JIT is fully warm and the speedup stabilises around 6×.