Self-Hosted Document AI: How to Run Document Intelligence On Your Own Infrastructure (2026)

Cloud-based document AI services are convenient — you send documents to an API, get structured data back, and pay by the page. They are also a non-starter for a significant portion of organizations whose work involves sensitive, confidential, or regulated documents that cannot leave their controlled environments.

Healthcare organizations covered by HIPAA cannot route patient records through third-party cloud services without extensive BAA negotiations and vendor security audits — which most SMB cloud services fail. Law firms operating under attorney-client privilege have clients who explicitly require that their documents never be processed by external cloud services. Government contractors working with controlled unclassified information face federal restrictions on external data processing. Finance teams handling M&A deal documents work under confidentiality agreements that prohibit third-party cloud processing.

For these teams, the choice is not "cloud vs. self-hosted" based on cost or convenience. It is "self-hosted or no AI at all."

This guide covers why self-hosted document AI exists, what it requires to deploy, and which platforms actually support it — because the options are considerably more limited than the market would suggest.

Why Most Document AI Platforms Don't Support Self-Hosting

The dominant document AI platforms — Docsumo, Nanonets, Rossum, LlamaParse — are cloud-only. Your documents are processed on their infrastructure. This is not a technical limitation; it is a business model choice. Cloud processing enables per-page pricing, easy updates, and centralized model improvement.

Enterprise platforms like Hyperscience and UiPath Document Understanding offer on-premise deployment, but at enterprise contract pricing — six-figure annual fees with dedicated implementation teams. This is not accessible to a 50-person law firm or a 100-person healthcare practice.

The gap this creates: organizations with genuine data sovereignty requirements and budgets under $50K/year have almost no viable options. They either run legacy OCR tools (Tesseract, ABBYY on-premise at high per-seat cost), build custom Python pipelines that require engineering teams, or simply do not automate document processing.

DokuBrain's self-hosted deployment mode is specifically designed to fill this gap — a full intelligent document processing platform that runs on your infrastructure via Docker Compose, accessible without an enterprise contract.

What Self-Hosted Document AI Actually Includes

A capable self-hosted document AI deployment needs several components. Knowing what each one does helps you evaluate whether a platform covers your requirements or requires you to assemble the stack yourself.

Document ingestion layer. Accepts files via upload, email, watched folder, or API. Stores raw documents in object storage. In the DokuBrain stack, MinIO provides S3-compatible object storage that runs locally.

Text extraction service. Converts documents to machine-readable text. For machine-generated PDFs, direct text extraction is fast and highly accurate. For scanned documents and photos, OCR is required. DokuBrain's Python extractor service supports multiple backends: IBM Docling and Marker for local, on-premise OCR; LlamaParse and LLMWhisperer as optional cloud augmentation if you choose to enable them.

AI extraction and classification. Identifies document type and extracts structured fields. This requires a language model with document understanding capability. DokuBrain uses transformer-based models that run locally — the extraction does not require sending documents to OpenAI or any external LLM provider unless you configure it to.

Vector database for semantic search. Enables RAG queries ("show me all contracts with auto-renewal clauses in Q2") and hybrid search across your document library. Qdrant is an open-source vector database that runs in Docker and requires no cloud connectivity.

Relational database. Stores document metadata, extracted fields, workflow state, audit logs. PostgreSQL 16 in Docker.

Queue and cache. Redis handles background job queuing (extraction jobs, email processing, webhook delivery) and caching.

Frontend and API. DokuBrain provides a Next.js web interface and Fastify REST API, both running as Docker services.

The full stack runs via a single docker compose up command. On a properly sized server, initial setup takes 30-60 minutes for a technical user.

Infrastructure Requirements

Minimum viable (development / low volume):

8 CPU cores, 16GB RAM, 100GB SSD
Handles machine-generated PDFs at moderate volume
Not recommended for production with scanned documents or high-frequency processing

Recommended (production, SMB scale):

16 CPU cores, 32-64GB RAM, 500GB+ NVMe SSD
Handles mixed document types at up to 10,000 pages/day
Supports concurrent users on the web interface

High-volume or GPU-accelerated:

16+ CPU cores, 64GB+ RAM, NVIDIA GPU with 8GB+ VRAM (for on-device LLM inference and GPU-accelerated OCR)
Handles 50,000+ pages/day, reduces OCR latency on scanned documents from seconds to sub-second

Storage sizing: Plan for 5-10x the raw document storage in system storage. A 1GB PDF library grows to 5-10GB when you account for extracted text, embeddings, thumbnails, and database overhead.

Network: Self-hosted deployments do not require internet connectivity for document processing. Outbound internet is optional — used only for LLM API calls if you configure cloud LLM providers, and for email ingestion if you use IMAP. Air-gapped deployments work with local LLM models only.

Deployment Guide: DokuBrain on Docker Compose

The following covers the standard deployment path for a production DokuBrain self-hosted instance.

Step 1: Server preparation. Install Docker and Docker Compose on Ubuntu 22.04 LTS or equivalent. Recommended: create a dedicated user for the deployment, configure firewall to allow only ports 80/443 (web) and 22 (SSH).

# Install Docker
curl -fsSL https://get.docker.com | bash
sudo usermod -aG docker $USER

# Clone the repository
git clone https://github.com/dokubrain/doku-engine.git
cd doku-engine

Step 2: Environment configuration. Copy .env.example to .env. The critical variables to configure:

# Database
DATABASE_URL=postgres://postgres:STRONG_PASSWORD@postgres:5432/dokuengine

# Object storage (local MinIO)
S3_ENDPOINT=http://minio:9000
S3_ACCESS_KEY=your-access-key
S3_SECRET_KEY=STRONG_SECRET

# JWT tokens
JWT_SECRET=generate-with-openssl-rand-base64-32
NEXTAUTH_SECRET=separate-strong-secret

# LLM provider (choose one)
LLM_PROVIDER=openai         # cloud
LLM_API_KEY=your-key

# OR use local inference (Ollama)
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama:11434

# Frontend URL (your server's domain or IP)
FRONTEND_URL=https://documents.yourcompany.com
NEXTAUTH_URL=https://documents.yourcompany.com

Step 3: Start the stack.

# Production stack
docker compose -f docker-compose.prod.yml up -d

# Verify all services are running
docker compose ps

Step 4: Initialize the database.

make db-migrate
make db-seed

Step 5: Configure reverse proxy. For HTTPS (required in production), place Caddy or nginx in front of the DokuBrain web service. Caddy handles automatic certificate provisioning from Let's Encrypt with a single configuration line.

# Caddyfile
documents.yourcompany.com {
  reverse_proxy localhost:3000
}

Step 6: Test the deployment. Access https://documents.yourcompany.com, register the first admin account, upload a test document, and verify extraction runs successfully.

Keeping Documents Off External LLMs

The default DokuBrain configuration uses OpenAI for document understanding and embedding generation. For fully air-gapped or privacy-constrained deployments, you need to replace this with local inference.

Option 1: Ollama (recommended for most self-hosted deployments). Ollama runs open-source LLMs locally — Llama 3, Mistral, Qwen, and others. Configure in .env:

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama:11434
LLM_MODEL=llama3.2:8b
EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=nomic-embed-text

Local models are slower than OpenAI API calls and require substantial RAM (7-13GB for 8B models, 30GB+ for 70B models). For document extraction tasks, 8B models like Llama 3.2 8B perform adequately on structured document types. Complex reasoning tasks (contract clause analysis, multi-document comparison) benefit from larger models.

Option 2: vLLM or LM Studio. Higher-performance local inference options for organizations with GPU capacity.

Option 3: Private Azure OpenAI or AWS Bedrock. If your organization uses Azure or AWS with private endpoints, you can route LLM calls through your cloud provider's private network rather than the public OpenAI API. Documents stay within your cloud environment. Configure the appropriate endpoint URLs in .env.

The Self-Hosting Landscape in 2026: What Your Options Actually Are

Beyond DokuBrain, the self-hosted document AI landscape is thin.

IBM Docling is an open-source Python library for document extraction — it handles PDF parsing, table extraction, and text chunking. It is not a complete platform: no web interface, no multi-user access, no workflow automation, no search. It is a component that developers use to build pipelines. Excellent for the extraction layer in a custom stack.

Marker is an open-source PDF-to-Markdown converter that runs locally. Similar scope to Docling — excellent extraction quality, no platform features.

Tesseract OCR is the dominant open-source OCR engine. Accurate for clean documents, falls behind commercial alternatives on degraded scans and handwriting. Widely integrated as the fallback OCR layer in many open-source stacks.

Mayan EDMS is an open-source document management system with basic workflow features. More document management than document intelligence — limited AI extraction capability.

Enterprise on-premise options (Hyperscience, UiPath, ABBYY Vantage, Kofax) all offer on-premise deployment but exclusively through enterprise contracts with dedicated implementation and annual fees starting at $50,000-150,000. Not a realistic option for organizations under 500 employees.

The practical conclusion: for organizations that need a full document intelligence platform with AI extraction, classification, search, RAG, and workflow automation — self-hosted — DokuBrain is currently the only accessible option. Organizations willing to build their own stack can assemble components (Docling for extraction, Qdrant for vectors, PostgreSQL for storage), but this requires significant engineering investment to maintain.

Security Considerations for Self-Hosted Deployments

Running document AI on your own infrastructure does not automatically mean it is secure. The following configurations are non-negotiable for production deployments handling sensitive documents.

Network isolation. The DokuBrain services (PostgreSQL, Redis, Qdrant, MinIO) should not be exposed to the internet. Only the web frontend and API require external access. Use Docker networks to isolate internal services.

Authentication. Configure SSO/SAML integration for organizations with existing identity providers. Enable two-factor authentication on all admin accounts. Rotate JWT secrets on a schedule.

Encryption at rest. Enable disk encryption on the server. Configure PostgreSQL transparent data encryption for the database. MinIO supports server-side encryption for stored documents.

Audit logging. DokuBrain logs all document access, extraction runs, and administrative actions. Ensure logs ship to a separate system so they cannot be modified in the event of a security incident.

Backup. Daily automated backups of the PostgreSQL database and MinIO document storage to a separate location. Test restore procedures quarterly. A document intelligence system that loses your extracted data and document library is worse than not having one.

Updates. Pull updated Docker images regularly. DokuBrain's security dependencies (the web app, API, and underlying libraries) receive updates when vulnerabilities are patched. Air-gapped deployments need a separate procedure to receive and verify image updates.

When Self-Hosted Is and Isn't the Right Call

Self-host if:

Your documents are covered by HIPAA, GDPR, attorney-client privilege, or industry regulations prohibiting third-party cloud processing
You have client confidentiality requirements that preclude cloud processing
You operate in an air-gapped or restricted network environment
Your document volumes are large enough that self-hosted infrastructure costs less than per-page cloud pricing (typically 50,000+ pages/month)
You have a technical team capable of managing Docker deployments and Linux servers

Use cloud if:

Your documents do not have data sovereignty requirements
You have no technical staff available for infrastructure management
Your volume is low and per-page costs are not material
You need to get started in hours rather than days

For most SMBs without compliance-driven requirements, DokuBrain's cloud deployment is simpler and immediately available. The self-hosted path is for teams where data sovereignty is non-negotiable, not a preference.

Frequently Asked Questions

What is self-hosted document AI?

Self-hosted document AI refers to running document intelligence software on your own servers rather than sending documents to a third-party cloud service. Your documents never leave your environment. All processing — OCR, extraction, classification, search — runs on infrastructure you control.

Why would a company choose self-hosted over cloud?

The primary drivers are data sovereignty (documents never leave your control), compliance requirements (HIPAA, GDPR, legal privilege), client confidentiality, and air-gapped environments where external internet connectivity is restricted. Cost at scale is a secondary factor.

What infrastructure do you need to self-host document AI?

At minimum: 8 CPU cores, 16GB RAM, 100GB SSD. For production: 16+ cores, 32-64GB RAM, 500GB+ NVMe. The DokuBrain stack runs on Docker Compose and requires PostgreSQL, Redis, Qdrant, and MinIO — all containerized.

Which document AI platforms support self-hosting?

DokuBrain supports full self-hosting via Docker Compose with accessible pricing. Enterprise platforms (Hyperscience, UiPath, ABBYY) offer on-premise but at $50K+ annual fees. Most commercial platforms (Docsumo, Nanonets, Rossum) are cloud-only. Open-source tools like Docling and Marker cover extraction components but are not complete platforms.

Is self-hosted document AI harder to maintain than cloud?

Self-hosted requires infrastructure management: monitoring, updates, backups. Cloud handles this invisibly. Docker Compose deployments are manageable for technical teams without dedicated DevOps staff — updates are single commands and backups follow standard procedures. The operational burden is real but not prohibitive for organizations with a developer or IT generalist.

Sources and further reading:

IBM Docling — Open-source document extraction library — open-source PDF extraction with table and figure support
Qdrant — Open-source vector database — deployment documentation for self-hosted vector search
HIPAA Guidance on Cloud Service Providers — HHS — guidance on HIPAA requirements for cloud document processing
GDPR Data Processing Requirements — European Data Protection Board — requirements for processing personal data in third-party systems

Originally published on DokuBrain Blog. DokuBrain is an intelligent document processing platform for SMBs, legal teams, and compliance teams.

推荐订阅源

DEV Community