惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
L
LINUX DO - 最新话题
V
Vulnerabilities – Threatpost
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
月光博客
月光博客
C
CXSECURITY Database RSS Feed - CXSecurity.com
小众软件
小众软件
Engineering at Meta
Engineering at Meta
The Hacker News
The Hacker News
A
Arctic Wolf
T
Threat Research - Cisco Blogs
M
MIT News - Artificial intelligence
C
Cybersecurity and Infrastructure Security Agency CISA
Latest news
Latest news
T
Tenable Blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
Forbes - Security
Forbes - Security
P
Proofpoint News Feed
C
Comments on: Blog
A
About on SuperTechFans
N
Netflix TechBlog - Medium
S
Security Affairs
博客园 - 司徒正美
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
P
Privacy International News Feed
T
Troy Hunt's Blog
PCI Perspectives
PCI Perspectives
腾讯CDC
www.infosecurity-magazine.com
www.infosecurity-magazine.com
罗磊的独立博客
The Cloudflare Blog
Microsoft Azure Blog
Microsoft Azure Blog
P
Privacy & Cybersecurity Law Blog
V
V2EX - 技术
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tailwind CSS Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Schneier on Security
Schneier on Security
SecWiki News
SecWiki News
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
Cyber Attacks, Cyber Crime and Cyber Security
The Last Watchdog
The Last Watchdog
C
CERT Recently Published Vulnerability Notes
S
Schneier on Security
S
Security Archives - TechRepublic
I
InfoQ
P
Palo Alto Networks Blog
博客园_首页
Hacker News: Ask HN
Hacker News: Ask HN

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
From Pixels to Prescriptions: Engineering OCR Pipelines for Medical Report Simplification Using MongoDB
Kotha Deepak · 2026-04-28 · via DEV Community

Team Members

@k_sidharthareddy_15 | @k-deepak-544 | @nupur_madhrey_07 | @avika_kashyap | @dheerajkumar08 | @chanda_rajkumar


Introduction

So here's the thing — when We started working on MediSimplify, a project that takes medical reports and converts them into patient-friendly language, We thought the hard part would be the NLP simplification. Turns out, just getting the text out of the document was already a mini-nightmare.

Medical reports come as everything: clean PDFs, scanned images, ancient faxed documents that someone scanned and emailed. OCR tools are finicky. Tesseract might not be installed on the deployment machine. A "PDF" might be a text-selectable document or a rasterized scan — and you can't tell which until you open it. We needed something that handled all of this gracefully, without crashing or silently returning garbage.

This post walks through how we built ocr.py — the dedicated OCR service layer inside MediSimplify — and the specific decisions that made it actually reliable in a messy real-world setting.


The Problem

Medical documents are inconsistent by nature. They arrive in formats that no single extraction strategy can handle cleanly.

Images (JPG/PNG) always need OCR. PDFs might have selectable text embedded, or they might be 300 DPI scans of printed pages — you don't know until you try. And raw OCR output is noisy: double spaces, broken newlines, garbled characters everywhere. That noise degrades everything downstream, especially the simplification model.

On top of that, Tesseract isn't guaranteed to be installed wherever the backend runs. If you just call pytesseract.image_to_string() directly, any user who hasn't configured Tesseract will see a cryptic Python exception — not useful at all.

Always try the cheap path first. Embedded PDF text extraction is instant and perfect quality. OCR is slow and error-prone. Only call Tesseract when you have to — and when you do, render pages at proper DPI so the results are actually good.


Our Solution

Rather than scattering OCR logic across the codebase, we built a single ocr.py service that all file upload endpoints call through. It has one job: accept raw bytes, return clean text. Here's what it does:

  • Automatically resolves Tesseract's path from config or system PATH
  • Raises a clear, user-readable error when OCR is unavailable
  • Tries embedded PDF text first (fast path via PyMuPDF)
  • Falls back to Tesseract OCR only for scanned pages
  • Normalises whitespace so downstream NLP gets clean input


Tech Stack

  • Frontend: Next.js (App Router) + TypeScript + Tailwind CSS
  • Backend: FastAPI + PyMongo
  • Database: MongoDB
  • OCR: pytesseract + PyMuPDF
  • AI Simplification: flan-t5-small transformer with medical-term fallback
  • Auth: JWT (login/signup/logout)

Key Features

1. Resolve Tesseract's path from config or system PATH

Tesseract can be installed anywhere — a system binary, a virtualenv, a custom path set by ops. Rather than hardcoding where to look, the service checks your config first, then falls back to a system path search.

configured = (settings.tesseract_cmd or "").strip()

if configured:
    if Path(configured).exists():
        return configured
    resolved = shutil.which(configured)
    if resolved:
        return resolved

return shutil.which("tesseract")

Enter fullscreen mode Exit fullscreen mode

This means the same code works on a developer's Mac, a Docker container, and a cloud VM without any changes.


2. Fail fast when OCR isn't available

If Tesseract couldn't be found, we don't want a cryptic FileNotFoundError buried inside pytesseract. We raise a custom, user-readable exception immediately — before any processing starts.

def _require_tesseract() -> None:
    if not TESSERACT_CMD:
        raise OCRUnavailableError(
            "OCR engine is not available. Install Tesseract "
            "or upload a text-based PDF."
        )

Enter fullscreen mode Exit fullscreen mode

OCRUnavailableError is caught by FastAPI and returned as a clean 422 or 503 response with a message the user can actually act on. No stack traces leaking to the frontend.


3. Image OCR path — clean and direct

For plain image uploads (JPG, PNG, TIFF), the path is simple: convert to RGB, run Tesseract, normalise whitespace.

def _extract_text_from_image(file_bytes: bytes) -> str:
    _require_tesseract()
    image = Image.open(io.BytesIO(file_bytes)).convert("RGB")
    text = pytesseract.image_to_string(image)
    return " ".join(text.split())

Enter fullscreen mode Exit fullscreen mode

Converting to RGB first avoids issues with RGBA PNGs — the alpha channel confuses some Tesseract versions. The " ".join(text.split()) at the end collapses all whitespace variants into single spaces.


4. PDF dual-path strategy — embedded text wins

This is the most important design decision in the whole service. PDFs are a spectrum, not a type. When a PDF has selectable text baked in, PyMuPDF can extract it in milliseconds with perfect fidelity. Only when that fails do we fall back to the expensive render-and-OCR route.

for page in doc:
    page_text = page.get_text("text")
    if page_text and page_text.strip():
        embedded_text_pages.append(
            " ".join(page_text.split())
        )

if embedded_text_pages:
    return "\n".join(embedded_text_pages)

Enter fullscreen mode Exit fullscreen mode

PyMuPDF's get_text("text") returns an empty string for scanned pages, so we check for actual content before appending. If even one page has embedded text, we return early. If all pages return empty — that's a scanned document, and we move to the OCR fallback.


5. OCR fallback for scanned PDFs — render at proper DPI

For scanned PDFs, we render each page to a pixmap and run Tesseract on it. The DPI setting matters more than most people realise.

for page in doc:
    pix = page.get_pixmap(dpi=220)
    image = Image.open(
        io.BytesIO(pix.tobytes("png"))
    ).convert("RGB")
    page_text = pytesseract.image_to_string(image)
    pages.append(page_text)

Enter fullscreen mode Exit fullscreen mode

Why 220 DPI? Below 150, Tesseract struggles with small medical font sizes. Above 300, you're burning memory for minimal accuracy gain on typical scan quality. 220 is a practical sweet spot for medical documents. Each page is rendered independently so we never hold the full document in memory at once.


Overall Workflow

Step 1 — Secure upload: The user uploads a file (image or PDF) through the web dashboard. It lands at the FastAPI /reports/upload endpoint, authenticated via JWT.

Step 2 — OCR service called: The raw bytes are handed to ocr.py. The service detects the file type and chooses a path.

Step 3 — Fast path (text PDF): PyMuPDF checks each page for embedded text. If found, it's extracted and normalised immediately — no OCR needed.

Step 4 — Fallback path (scanned PDF or image): Tesseract availability is verified first. If missing, an OCRUnavailableError is raised with a clear message. If present, pages are rendered at 220 DPI and OCR'd one by one.

Step 5 — Simplification: The clean extracted text is passed to the flan-t5-small pipeline, which generates a patient-friendly explanation and highlights important medical terms.

Step 6 — Result stored: The simplified result and key terms are saved to MongoDB under the user's account. The user sees it on their dashboard.


Challenges We Faced

Challenge 1: OCR reliability depends on system setup

The problem: Tesseract behaves differently across OS, installation method, and locale settings. On macOS via Homebrew it just works. In a Debian Docker image you need specific language packs. On some cloud VMs it's not installed at all. We wasted hours debugging before realising the issue was never our code.

The fix: The _require_tesseract() gate runs before any OCR call. If Tesseract isn't there, the user gets a clean error message telling them exactly what to do — not a Python traceback. We also added Tesseract installation as a required step in our README and Dockerfile.


Challenge 2: Raw OCR output broke the simplification model

The problem: Early versions piped raw Tesseract output directly into the language model. The model kept getting confused by double newlines, hyphenated line-breaks from PDF column layouts, and random whitespace characters. Accuracy on medical term identification dropped noticeably.

The fix: Instead of cleaning up in the simplification layer, we normalise immediately in the OCR service. Every text string goes through " ".join(text.split()) before being returned. Simple, but it eliminated the most common noise patterns and gave the NLP model clean, consistent input.


Challenge 3: Some PDFs have both embedded text AND scanned pages

The problem: Hospital discharge summaries sometimes have a typed cover page and scanned test result attachments in the same PDF. Our initial strategy of "embedded text OR OCR" missed the scanned pages entirely.

The fix: We shifted from document-level to page-level decisions. Each page is checked for embedded text independently. Pages with content use fast extraction; pages without content fall through to OCR. The final result merges all pages in order — and handles hybrid documents cleanly.


What We Learned

1. OCR reliability is a system problem, not a code problem. You can write perfect Python and still get garbage output if Tesseract isn't configured right. Explicit dependency checks are essential — don't trust that the environment is set up correctly.

2. Prioritise embedded text over OCR, always. Mixed-document pipelines should attempt native extraction first. OCR is the last resort, not the default. This alone cut our average extraction time by 70% for the most common document type.

3. Normalise early, normalise once. Whitespace normalisation in the extraction layer is far better than doing it later. By the time text reaches the NLP model, it should already be clean. Downstream components shouldn't have to defend against upstream noise.

4. Error messages are UX. OCRUnavailableError with a sentence explaining what to do is infinitely more useful than a FileNotFoundError: tesseract with a stack trace. The extra 10 minutes to write a good exception class saves everyone hours of confusion later.

5. DPI isn't a detail — it's a quality lever. Rendering scanned PDFs at 72 DPI (screen resolution) gives terrible OCR results on small fonts. 220 DPI was the sweet spot between quality and memory usage for medical documents specifically. Always benchmark for your document type.


Conclusion

MediSimplify’s OCR layer is surprisingly small—around 80 lines of Python—but it handles a lot of real-world complexity. Decisions like dynamically resolving the Tesseract path, failing fast with clear errors, checking for embedded text before running OCR, and cleaning up output early all came from practical issues during development.

If you're working with user-uploaded documents, it’s worth treating OCR as a core part of your system, with proper error handling and dependency management—it makes debugging much easier later. The next step is adding preprocessing, like fixing skewed pages and reducing noise, which should noticeably improve results, especially for older, low-quality medical scans.


Try It Yourself

GitHub: https://github.com/K-Sidhartha-Reddy/MediSimplify

Demo: