惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

C
Comments on: Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
李成银的技术随笔
美团技术团队
博客园 - 三生石上(FineUI控件)
爱范儿
爱范儿
Simon Willison's Weblog
Simon Willison's Weblog
Cisco Talos Blog
Cisco Talos Blog
博客园 - 司徒正美
Jina AI
Jina AI
S
SegmentFault 最新的问题
Recorded Future
Recorded Future
大猫的无限游戏
大猫的无限游戏
月光博客
月光博客
E
Exploit-DB.com RSS Feed
J
Java Code Geeks
腾讯CDC
V
V2EX
NISL@THU
NISL@THU
M
MIT News - Artificial intelligence
量子位
T
Tor Project blog
T
Threatpost
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
博客园 - Franky
Scott Helme
Scott Helme
U
Unit 42
博客园 - 聂微东
Hacker News - Newest:
Hacker News - Newest: "LLM"
雷峰网
雷峰网
Vercel News
Vercel News
GbyAI
GbyAI
MyScale Blog
MyScale Blog
Microsoft Security Blog
Microsoft Security Blog
Recent Commits to openclaw:main
Recent Commits to openclaw:main
aimingoo的专栏
aimingoo的专栏
H
Hackread – Cybersecurity News, Data Breaches, AI and More
有赞技术团队
有赞技术团队
W
WeLiveSecurity
T
Tailwind CSS Blog
S
Schneier on Security
Hugging Face - Blog
Hugging Face - Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Y
Y Combinator Blog
I
Intezer
Last Week in AI
Last Week in AI
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

From Side Project to Student Savior: My AI PPT & Resume Tool Crossed 1.5K+ Users Why Story Points Don’t Work in the AI Era, And What Should Take Their Place Instead. How to Extract Tables from PDFs with AI: 4 Methods That Actually Work (2026) IDP vs OCR: What's the Difference — and Which Does Your Business Actually Need? Automated PII Detection and Redaction in Business Documents: A Practical Guide Human-in-the-Loop Document Review: When to Use It and How to Set It Up (2026) Document Processing Without RPA: A Modern Approach for Small Teams Reducto Alternative: When You Need More Than a Document Parser (2026) Hermes Agent vs LangChain vs CrewAI: When to Reach for Each SparshAI: I Built an Offline AI Tutor for Students Using Gemma 4 — Here's What Happened Building NeuroSense AI: A Human-Centered Stress Insight Assistant Powered by Gemma Why I Built a Privacy-First Dev Toolkit GAS Input Tags: Ability Activation Without Hardcoded Bindings AI Legal Document Advisor Supported By Gemm 4 Model Building Convertify in Public Week 10: PDF Cluster + Blog Launch CureNet AI: Decentralized Health Intelligence for India, Powered by Gemma 4 and ABHA Standardization When Open-Weights AI Meets a Broken Healthcare System: Deploying Gemma 4 in Rural India V.A.L.I.D. Google I/O 2026: The Year Google Stopped Building AI Assistants and Started Shipping AI Engineers Bondmap: AI-Powered Relationship Network That Maps How You're Connected to Everyone Using Gemma 4 Gemma 4 challenge inspired me to build my first app! 96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop From a Student Who Used CircuitVerse to a GSoC Contributor — My Community Bonding Story How Bf-Tree Keeps Mini-Pages Small, Hot, and Cheap to Evict I asked Claude to explain the chip war and ended up understanding modern geopolitics differently Stop Manually Checking for Server Updates: Automate With Email Notifications Nostalgia Meets Cybersecurity: Spotting Modern Scams in a Retro OS Simulator - Forward or Fraud CRACKING CODING INTERVIEW From Python to Production Pipeline :A Practical guide to Apache Airflow Antigravity 2.0: Google Just Changed What It Means to Be an Engineer I Built a Free Sticker Maker Because Every Other One Hid the Export How I bypassed Blazor WebAssembly's Virtual DOM using raw WASM pointers Distributed Tracing for LLM Agents: When MCP Makes Tool Calls Observable The Zero-Budget Memory Setup Behind My AI Agent Workflow No database. No framework. Just files, startup order, correction logs, and discipline. I Built an AI Second Brain with Gemma 4 The Most Exciting Google I/O 2026 Announcement for Me: HTML-in-Canvas CrisisLens: Compressing Disaster Scenes into 200-Byte Emergency Payloads with Gemma 4 I'm 15 and I built a todo app with Telegram Stars payments — only legal way for me to monetize before turning 18 Crypto Branding After the Token Launch Building an on-chain alerts bot in Python without any blockchain library FinePrint — An AI Pocket Lawyer That Decodes Predatory Contracts Using Gemma 4 How to Connect OpenAI with Supabase in 10 Minutes for a Lightning-Fast AI MVP One AI Gateway for AWS Bedrock, Google Vertex AI, Gemini, and Anthropic Reading Log #9 — Aoashi The Tacit Dimension Thinking, Fast and Slow Web3 Onboarding Is Not a Wallet Problem. It Is a Trust Problem. FHE Prompt Privacy: The Metadata Leak Your Demo Still Has Software Might Be Becoming Agent-Aware: What if software starts coordinating itself? The Silent Killers of Go Concurrency: Mutexes, Semaphores, and Goroutine Leaks Lynx framework first look Building Aries AI: A Solo-Built AI Abacus Tutor on OpenAI + Supabase + Render + Razorpay I built a paid Telegram bot. Here's what Telegram Stars actually pay. Transfer Fees, Metadata, and Soulbound Tokens: A Tour of Solana Token Extensions Improving AI resume matching with prompt iteration — 7.37 to 8.37/10 7 things you can do with Rogue Studio that no other AI IDE will let you do Why I Think WordPress Still Matters Reading Log #7 — Aoashi Guns, Germs, and Steel Distinction Open Models and the Sub-Saharan Region What 12 Months of AI-Generated Pull Requests Taught My Engineering Team Feature Flags in .NET 8: ASP.NET Core, Minimal APIs, Blazor The Quiet Architecture of Systems That Refuse to Die From OOP to SOLID: Everything You Need to Know in One Article I Scanned 5 Common LangChain Agent Patterns. Every Single One Was Over-Permissioned. Production-Ready MCP Servers in 60 Seconds (Auth, Rate Limits, Audit Logs Included) Dari OOP ke SOLID: Semua yang Perlu Kamu Tahu dalam Satu Artikel The Most Important Part of Google I/O 2026 Wasn’t a Model — It Was the Infrastructure When SafetyCo Goes to War: Anthropic, the DOD, and the Limits of Ideals-Based Frameworks Why AI Memory Resolves Too Much — And What to Preserve Instead What Gemma 4 Means for the Future of Local AI (And Why It Matters More Than GPT-5) The Classroom Gap: Why Applied AI Has Yet to Transform How the World Learns Cell-to-Sentence (C2S): LLM-Powered scRNA-seq Annotation with Gemma 4 GitHub rust-2026-template — my Rust starter in 2026 Stop Editing JSON by Hand How I Turned an Old Movie Recommendation Project Into a Cinematic AI Platform Linux Command Line: The 25 Commands I Use Every Day (2026) The Multilingual SEO Trap: When Your Meta Description Speaks the Wrong Language young-colleague-job-worries What I Learned About Token Design on Solana as a Web2 Developer 19/30 Days System Design Questions! My first Android App - NightLock Tabula vs Camelot vs pdfplumber in 2026: Which Python Library Actually Wins? AI Agent Failure Loops: When Persistence Becomes a Quality Bug Experienced devs are slower with AI and they don't even know it Building a No-KYC Poker Bot: What I Learned Automating Crypto Tables React.lazy + chunk errors: how to recover users stuck after a deploy How I Built Clinical Trials API - From Public Data to RapidAPI in 2 Weeks Where is the Code Editor?! - Reception for Antigravity 2.0 I built a tool to catch AI coding agents misbehaving — and put zero AI in it Reading Log #5 — Aoashi Seeing Like a State Distinction [Boost] How to Build a Clinical Trial Search App in 5 Minutes - Clinical Trials API Tutorial Gemma For Dummies: I Knew Nothing. Now I'm Running AI on My Laptop. I gave an AI a Kill Switch. Here's what I learned about trust in local-first tooling. Notification System Technical Specification What ElumKit v0.1 already does (and the one primitive I missed) Why Every Student Developer Should Know About Microsoft Imagine Cup 🚀 Mikplanu: Empowering Education through Edge AI Sovereignty 터미널 AI 에이전트 구축 (v9) What If Your Portfolio Verifier Could Actually See Your UI? Node.js Event Loop Architecture — How a Single-Threaded Runtime Handles Massive Concurrency From Concept to Code: Bringing Your Vision to Life with Michael K. Laweh
Self-Hosted Document AI: How to Run Document Intelligence On Your Own Infrastructure (2026)
DokuBrain · 2026-05-25 · via DEV Community

Cloud-based document AI services are convenient — you send documents to an API, get structured data back, and pay by the page. They are also a non-starter for a significant portion of organizations whose work involves sensitive, confidential, or regulated documents that cannot leave their controlled environments.

Healthcare organizations covered by HIPAA cannot route patient records through third-party cloud services without extensive BAA negotiations and vendor security audits — which most SMB cloud services fail. Law firms operating under attorney-client privilege have clients who explicitly require that their documents never be processed by external cloud services. Government contractors working with controlled unclassified information face federal restrictions on external data processing. Finance teams handling M&A deal documents work under confidentiality agreements that prohibit third-party cloud processing.

For these teams, the choice is not "cloud vs. self-hosted" based on cost or convenience. It is "self-hosted or no AI at all."

This guide covers why self-hosted document AI exists, what it requires to deploy, and which platforms actually support it — because the options are considerably more limited than the market would suggest.


Why Most Document AI Platforms Don't Support Self-Hosting

The dominant document AI platforms — Docsumo, Nanonets, Rossum, LlamaParse — are cloud-only. Your documents are processed on their infrastructure. This is not a technical limitation; it is a business model choice. Cloud processing enables per-page pricing, easy updates, and centralized model improvement.

Enterprise platforms like Hyperscience and UiPath Document Understanding offer on-premise deployment, but at enterprise contract pricing — six-figure annual fees with dedicated implementation teams. This is not accessible to a 50-person law firm or a 100-person healthcare practice.

The gap this creates: organizations with genuine data sovereignty requirements and budgets under $50K/year have almost no viable options. They either run legacy OCR tools (Tesseract, ABBYY on-premise at high per-seat cost), build custom Python pipelines that require engineering teams, or simply do not automate document processing.

DokuBrain's self-hosted deployment mode is specifically designed to fill this gap — a full intelligent document processing platform that runs on your infrastructure via Docker Compose, accessible without an enterprise contract.


What Self-Hosted Document AI Actually Includes

A capable self-hosted document AI deployment needs several components. Knowing what each one does helps you evaluate whether a platform covers your requirements or requires you to assemble the stack yourself.

Document ingestion layer. Accepts files via upload, email, watched folder, or API. Stores raw documents in object storage. In the DokuBrain stack, MinIO provides S3-compatible object storage that runs locally.

Text extraction service. Converts documents to machine-readable text. For machine-generated PDFs, direct text extraction is fast and highly accurate. For scanned documents and photos, OCR is required. DokuBrain's Python extractor service supports multiple backends: IBM Docling and Marker for local, on-premise OCR; LlamaParse and LLMWhisperer as optional cloud augmentation if you choose to enable them.

AI extraction and classification. Identifies document type and extracts structured fields. This requires a language model with document understanding capability. DokuBrain uses transformer-based models that run locally — the extraction does not require sending documents to OpenAI or any external LLM provider unless you configure it to.

Vector database for semantic search. Enables RAG queries ("show me all contracts with auto-renewal clauses in Q2") and hybrid search across your document library. Qdrant is an open-source vector database that runs in Docker and requires no cloud connectivity.

Relational database. Stores document metadata, extracted fields, workflow state, audit logs. PostgreSQL 16 in Docker.

Queue and cache. Redis handles background job queuing (extraction jobs, email processing, webhook delivery) and caching.

Frontend and API. DokuBrain provides a Next.js web interface and Fastify REST API, both running as Docker services.

The full stack runs via a single docker compose up command. On a properly sized server, initial setup takes 30-60 minutes for a technical user.


Infrastructure Requirements

Minimum viable (development / low volume):

  • 8 CPU cores, 16GB RAM, 100GB SSD
  • Handles machine-generated PDFs at moderate volume
  • Not recommended for production with scanned documents or high-frequency processing

Recommended (production, SMB scale):

  • 16 CPU cores, 32-64GB RAM, 500GB+ NVMe SSD
  • Handles mixed document types at up to 10,000 pages/day
  • Supports concurrent users on the web interface

High-volume or GPU-accelerated:

  • 16+ CPU cores, 64GB+ RAM, NVIDIA GPU with 8GB+ VRAM (for on-device LLM inference and GPU-accelerated OCR)
  • Handles 50,000+ pages/day, reduces OCR latency on scanned documents from seconds to sub-second

Storage sizing: Plan for 5-10x the raw document storage in system storage. A 1GB PDF library grows to 5-10GB when you account for extracted text, embeddings, thumbnails, and database overhead.

Network: Self-hosted deployments do not require internet connectivity for document processing. Outbound internet is optional — used only for LLM API calls if you configure cloud LLM providers, and for email ingestion if you use IMAP. Air-gapped deployments work with local LLM models only.


Deployment Guide: DokuBrain on Docker Compose

The following covers the standard deployment path for a production DokuBrain self-hosted instance.

Step 1: Server preparation. Install Docker and Docker Compose on Ubuntu 22.04 LTS or equivalent. Recommended: create a dedicated user for the deployment, configure firewall to allow only ports 80/443 (web) and 22 (SSH).

# Install Docker
curl -fsSL https://get.docker.com | bash
sudo usermod -aG docker $USER

# Clone the repository
git clone https://github.com/dokubrain/doku-engine.git
cd doku-engine

Enter fullscreen mode Exit fullscreen mode

Step 2: Environment configuration. Copy .env.example to .env. The critical variables to configure:

# Database
DATABASE_URL=postgres://postgres:STRONG_PASSWORD@postgres:5432/dokuengine

# Object storage (local MinIO)
S3_ENDPOINT=http://minio:9000
S3_ACCESS_KEY=your-access-key
S3_SECRET_KEY=STRONG_SECRET

# JWT tokens
JWT_SECRET=generate-with-openssl-rand-base64-32
NEXTAUTH_SECRET=separate-strong-secret

# LLM provider (choose one)
LLM_PROVIDER=openai         # cloud
LLM_API_KEY=your-key

# OR use local inference (Ollama)
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama:11434

# Frontend URL (your server's domain or IP)
FRONTEND_URL=https://documents.yourcompany.com
NEXTAUTH_URL=https://documents.yourcompany.com

Enter fullscreen mode Exit fullscreen mode

Step 3: Start the stack.

# Production stack
docker compose -f docker-compose.prod.yml up -d

# Verify all services are running
docker compose ps

Enter fullscreen mode Exit fullscreen mode

Step 4: Initialize the database.

make db-migrate
make db-seed

Enter fullscreen mode Exit fullscreen mode

Step 5: Configure reverse proxy. For HTTPS (required in production), place Caddy or nginx in front of the DokuBrain web service. Caddy handles automatic certificate provisioning from Let's Encrypt with a single configuration line.

# Caddyfile
documents.yourcompany.com {
  reverse_proxy localhost:3000
}

Enter fullscreen mode Exit fullscreen mode

Step 6: Test the deployment. Access https://documents.yourcompany.com, register the first admin account, upload a test document, and verify extraction runs successfully.


Keeping Documents Off External LLMs

The default DokuBrain configuration uses OpenAI for document understanding and embedding generation. For fully air-gapped or privacy-constrained deployments, you need to replace this with local inference.

Option 1: Ollama (recommended for most self-hosted deployments). Ollama runs open-source LLMs locally — Llama 3, Mistral, Qwen, and others. Configure in .env:

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama:11434
LLM_MODEL=llama3.2:8b
EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=nomic-embed-text

Enter fullscreen mode Exit fullscreen mode

Local models are slower than OpenAI API calls and require substantial RAM (7-13GB for 8B models, 30GB+ for 70B models). For document extraction tasks, 8B models like Llama 3.2 8B perform adequately on structured document types. Complex reasoning tasks (contract clause analysis, multi-document comparison) benefit from larger models.

Option 2: vLLM or LM Studio. Higher-performance local inference options for organizations with GPU capacity.

Option 3: Private Azure OpenAI or AWS Bedrock. If your organization uses Azure or AWS with private endpoints, you can route LLM calls through your cloud provider's private network rather than the public OpenAI API. Documents stay within your cloud environment. Configure the appropriate endpoint URLs in .env.


The Self-Hosting Landscape in 2026: What Your Options Actually Are

Beyond DokuBrain, the self-hosted document AI landscape is thin.

IBM Docling is an open-source Python library for document extraction — it handles PDF parsing, table extraction, and text chunking. It is not a complete platform: no web interface, no multi-user access, no workflow automation, no search. It is a component that developers use to build pipelines. Excellent for the extraction layer in a custom stack.

Marker is an open-source PDF-to-Markdown converter that runs locally. Similar scope to Docling — excellent extraction quality, no platform features.

Tesseract OCR is the dominant open-source OCR engine. Accurate for clean documents, falls behind commercial alternatives on degraded scans and handwriting. Widely integrated as the fallback OCR layer in many open-source stacks.

Mayan EDMS is an open-source document management system with basic workflow features. More document management than document intelligence — limited AI extraction capability.

Enterprise on-premise options (Hyperscience, UiPath, ABBYY Vantage, Kofax) all offer on-premise deployment but exclusively through enterprise contracts with dedicated implementation and annual fees starting at $50,000-150,000. Not a realistic option for organizations under 500 employees.

The practical conclusion: for organizations that need a full document intelligence platform with AI extraction, classification, search, RAG, and workflow automation — self-hosted — DokuBrain is currently the only accessible option. Organizations willing to build their own stack can assemble components (Docling for extraction, Qdrant for vectors, PostgreSQL for storage), but this requires significant engineering investment to maintain.


Security Considerations for Self-Hosted Deployments

Running document AI on your own infrastructure does not automatically mean it is secure. The following configurations are non-negotiable for production deployments handling sensitive documents.

Network isolation. The DokuBrain services (PostgreSQL, Redis, Qdrant, MinIO) should not be exposed to the internet. Only the web frontend and API require external access. Use Docker networks to isolate internal services.

Authentication. Configure SSO/SAML integration for organizations with existing identity providers. Enable two-factor authentication on all admin accounts. Rotate JWT secrets on a schedule.

Encryption at rest. Enable disk encryption on the server. Configure PostgreSQL transparent data encryption for the database. MinIO supports server-side encryption for stored documents.

Audit logging. DokuBrain logs all document access, extraction runs, and administrative actions. Ensure logs ship to a separate system so they cannot be modified in the event of a security incident.

Backup. Daily automated backups of the PostgreSQL database and MinIO document storage to a separate location. Test restore procedures quarterly. A document intelligence system that loses your extracted data and document library is worse than not having one.

Updates. Pull updated Docker images regularly. DokuBrain's security dependencies (the web app, API, and underlying libraries) receive updates when vulnerabilities are patched. Air-gapped deployments need a separate procedure to receive and verify image updates.


When Self-Hosted Is and Isn't the Right Call

Self-host if:

  • Your documents are covered by HIPAA, GDPR, attorney-client privilege, or industry regulations prohibiting third-party cloud processing
  • You have client confidentiality requirements that preclude cloud processing
  • You operate in an air-gapped or restricted network environment
  • Your document volumes are large enough that self-hosted infrastructure costs less than per-page cloud pricing (typically 50,000+ pages/month)
  • You have a technical team capable of managing Docker deployments and Linux servers

Use cloud if:

  • Your documents do not have data sovereignty requirements
  • You have no technical staff available for infrastructure management
  • Your volume is low and per-page costs are not material
  • You need to get started in hours rather than days

For most SMBs without compliance-driven requirements, DokuBrain's cloud deployment is simpler and immediately available. The self-hosted path is for teams where data sovereignty is non-negotiable, not a preference.


Frequently Asked Questions

What is self-hosted document AI?

Self-hosted document AI refers to running document intelligence software on your own servers rather than sending documents to a third-party cloud service. Your documents never leave your environment. All processing — OCR, extraction, classification, search — runs on infrastructure you control.

Why would a company choose self-hosted over cloud?

The primary drivers are data sovereignty (documents never leave your control), compliance requirements (HIPAA, GDPR, legal privilege), client confidentiality, and air-gapped environments where external internet connectivity is restricted. Cost at scale is a secondary factor.

What infrastructure do you need to self-host document AI?

At minimum: 8 CPU cores, 16GB RAM, 100GB SSD. For production: 16+ cores, 32-64GB RAM, 500GB+ NVMe. The DokuBrain stack runs on Docker Compose and requires PostgreSQL, Redis, Qdrant, and MinIO — all containerized.

Which document AI platforms support self-hosting?

DokuBrain supports full self-hosting via Docker Compose with accessible pricing. Enterprise platforms (Hyperscience, UiPath, ABBYY) offer on-premise but at $50K+ annual fees. Most commercial platforms (Docsumo, Nanonets, Rossum) are cloud-only. Open-source tools like Docling and Marker cover extraction components but are not complete platforms.

Is self-hosted document AI harder to maintain than cloud?

Self-hosted requires infrastructure management: monitoring, updates, backups. Cloud handles this invisibly. Docker Compose deployments are manageable for technical teams without dedicated DevOps staff — updates are single commands and backups follow standard procedures. The operational burden is real but not prohibitive for organizations with a developer or IT generalist.


Sources and further reading:


Originally published on DokuBrain Blog. DokuBrain is an intelligent document processing platform for SMBs, legal teams, and compliance teams.