How I bypassed Vercel Serverless timeouts to build a decoupled document ingestion pipeline

If you’ve ever tried to build an asynchronous document processing or RAG pipeline using Next.js API routes hosted on Vercel, you know that even with max duration configuration adjustments, keeping intensive computing tasks entirely inline on serverless routes can get messy.

When a user uploads large PDFs or batch data sources, parsing the text layers, chunking them semantically, and running batch embedding requests consumes serious time. Relying on synchronous, API-facing function execution windows for deep I/O tasks often leaves you managing brittle state.

To make my application, I spent my time decoupling my stack into a clean, asynchronous background processing worker architecture. Here is a breakdown of how the data flows.

The Stack Architecture
Ingress Layer (Next.js): The API endpoints strictly handle incoming request validation, file storage organization, and API idempotency keys (managed via Upstash Redis).

The Worker Queue (BullMQ + TCP Redis): Instead of processing files inline, the Next.js route enqueues the task into a BullMQ background line. Because BullMQ requires a persistent, low-latency binary TCP connection, this Redis instance is hosted directly on Railway alongside our workers.

Persistent Worker Instance: A standalone Node.js background process running on Railway listens to the BullMQ stream. Because it runs on a dedicated server environment, it completely removes the headache of managing serverless execution constraints. It streams files from Cloudflare R2, runs semantic paragraph chunking, and processes text embeddings in parallel batches.

Concurrency and Data Privacy Safety
Handling high-volume API requests means designing for deep multi-tenant safety. To prevent concurrent race conditions across quota meters during multi-file processing, the platform utilizes a Postgres SELECT FOR UPDATE block inside an explicit database transaction to lock and update user tokens safely at the database layer.

Furthermore, to solve data privacy and compliance hurdles for teams that do not want third-party database lock in, I implemented a strict Stateless Pass-Through Mode. By sending a passthrough: true flag to the endpoint, the background worker processes the document, generates the raw 1,536 dimension float arrays using OpenAI, streams the payload back via an asynchronous webhook, and instantly flushes the server RAM. Zero data retention.

Open Beta & Feedback
I've packaged this entire decoupled pipeline layout into a developer utility called ContextFlow AI.

The public beta is completely live, open, and free to try. You can check out the landing page, read the documentation, and inspect the JSON payload schemas directly at https://usecontextflow.com.

I'd love to get your thoughts on the webhook event schemas or how you are structuring background queues for your own AI applications!

推荐订阅源

DEV Community