惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

The GitHub Blog
The GitHub Blog
T
ThreatConnect
C
Check Point Blog
T
The Exploit Database - CXSecurity.com
U
Unit 42
云风的 BLOG
云风的 BLOG
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tenable Blog
博客园 - 叶小钗
D
Docker
T
Threatpost
WordPress大学
WordPress大学
腾讯CDC
I
Intezer
T
Tailwind CSS Blog
Engineering at Meta
Engineering at Meta
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Hugging Face - Blog
Hugging Face - Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
The Register - Security
The Register - Security
Stack Overflow Blog
Stack Overflow Blog
PCI Perspectives
PCI Perspectives
S
Security Archives - TechRepublic
Simon Willison's Weblog
Simon Willison's Weblog
A
Arctic Wolf
MongoDB | Blog
MongoDB | Blog
小众软件
小众软件
Hacker News: Ask HN
Hacker News: Ask HN
O
OpenAI News
博客园 - 【当耐特】
L
LINUX DO - 最新话题
C
Comments on: Blog
S
Securelist
月光博客
月光博客
S
Secure Thoughts
Security Latest
Security Latest
MyScale Blog
MyScale Blog
NISL@THU
NISL@THU
F
Full Disclosure
M
Microsoft Research Blog - Microsoft Research
T
True Tiger Recordings
SecWiki News
SecWiki News
aimingoo的专栏
aimingoo的专栏
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 热门话题
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
L
Lohrmann on Cybersecurity
H
Help Net Security

DEV Community

AllReduce Stalls Are Network Stalls. Most Tools See Neither. Agents are workflows. SirenSpec is the workflow tool that admits it. I Built FreeDevUtils — 60+ Free In-Browser Developer Tools using github copilot an google gemini pro for developer community Most programmers are miserable and we pretend that's normal 🇺🇸 Rails Realtime ERD: Visualize Your Rails Schema in Real Time RAG for Codebases Is Harder Than It Looks When Cucumber Grows Too Big: Pain Points, Lessons Learned, and Alternatives Pay for Any API from Inside Claude with Base MCP + APIbase I Set Up CI/CD for My React App in 5 Minutes — Here's the Exact YML Config GCSI 2026: AI Readiness in a City Built in Layers 🇧🇷 Rails Realtime ERD: visualize seu schema Rails em tempo real Rails Realtime ERD: visualize seu schema Rails em tempo real The Moment the JSON Config Parser Became the Enemy n8n vs Zapier — Which Is Right for Production Workflows? AI Security Tools Are Drowning Open Source Maintainers — curl Is the Canary I was wondering whether we can write both the Deployment and Service manifest in the same file? but your explaination made it clearer GitHub Copilot Has a New App. Here's What Changed for My Daily Workflow. 5 gotchas I hit moving LLM logs from Postgres to ClickHouse AWS Database Savings Plans: What DB Teams Need to Know Self-Expiring Report-Only CI Gates: From Advisory to Enforced Cadence v8.4: a multi-model coding harness where Claude writes, Codex reviews, and Bugbot triages What happens when an AI agent commits to your repo How I Run Two Claude Accounts as One How to Pass the Google Play 12-Tester Rule Without Losing Your Sanity The Degradation Ladder: How Systems Fail Before They Fail Deploy Ping Identity Products on Kubernetes with a Single Operator Flutter Deep Linking: Complete Guide for Android App Links & iOS Universal Links I Read Anthropic's 2026 Agentic Coding Trends Report. Here's What It Actually Means for Engineering Teams. Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: The Standby Cluster Method Less Than a Penny Per Document How to Build Your First REST API in Node.js ? MCP Isn't a Model Feature. It's a Power Outlet for Your Tools. Testing JavaScript: A Practical Guide to TDD with Jest (2026) When Your Search Tree Becomes the Bottleneck in a Distributed Game Server GitHub Code Coverage in Pull Requests: What Developers Should Set Up Now Vibe Coding vs. Real Coding: Why Both Are Wrong (and Right) Why I’m Building a Privacy-First SOW Analyzer to Kill Scope Creep (Launching Next Month) FHIR in Indian Healthcare IT: What Every Developer Building HMIS Software Needs to Know Data Normalization Across Dublin Rental Portals: How to Make Listings Comparable Building a Rental Aggregator When Daft.ie Already Exists Finishing Hakozuna HZ5: From Experimental Allocator to DOI-Archived Artifact Building search features for users in different timezones. The remote renter problem. State management for real-world workflows: tracking apartment viewings and applications How I built automated reminders into a Slack approval tool with zero coding experience Identity Verification Just Became Infrastructure — And Your Evidence Better Survive It The Production Deployment Checklist Senior Devs Never Skip (2026) Stop relying on Cursor AI. You are destroying your engineering brain Building an Automated Invoice Processing Pipeline with Node.js Built and launched WebDoctor AI 🌐🧠 AI Citation Registry: Decentralized Coordination in Government AI Attribution How to Fix CSV Encoding Issues (UTF-8, Windows-1252, and More) Building the private markets data infra for AI agents Why Your Resume Keeps Getting Rejected by ATS Systems (Even When You’re Qualified) Building an Offline-First Architecture for 40,000+ Concurrent RFID Scans I Built a Tiny Chrome Extension to Save My Mouse Wheel (Auto Scroll) # I Got Burned by Socket Chaos. Here's How I Finally Built Real-Time Calls That Actually Work. How to Cut Your CSS File Size by 40% Without Losing Any Styles Building a Zero-Friction Browser Screen Recorder (Just Press Alt + R) AI Wrappers Are Dying: Why Most AI Products Fail The Operators Regret: How We Blew Up the Event Bus at 3 AM 'Verified' mudou de significado: o que agentic engineering exige de times de desenvolvimento A Flask Vulnerability Walkthrough How DeepMind AlphaProof Nexus Cracks 56-Year-Old Math: Agentic LLM Loops and Lean Formal Verification Why your AI shouldn't decide alone: the 3-options pattern Pourquoi votre IA ne devrait pas trancher seule un audit ou une permission One year of self-hosted n8n on a $6 Hetzner VPS Adding comments to a static Astro blog with Netlify Forms I Built 30+ Free Online Tools With Zero Signup, Zero Tracking, and Instant Access We just launched on the Shopify App Store - here's the architecture behind what we built How to Delete a Cloudflare Access Application (Without Guesswork) Why Backend Secrets Leak More Often Than Developers Think: A Deep Dive into Runtime Security with XyPriss I built an MCP server for DNS + email security — 37 tools for Claude Code, Cursor, Windsurf CI/CD avec GitHub Actions I Used Amazon Bedrock as My AI Coding Partner for a Day Here's What Happened From Vibe Coding to Verified Engineering Building a ESP32-CAM Helmet Detection System Using and CircuitDigest Cloud Vitalii Kiro: The Drone War Is Over. The War of Algorithms Begins App Development Costs in India (2026): A No-Fluff Technical Breakdown How to Automate File Renaming with AI and OCR Why green CI doesn't mean your system works Capacity Governance in Microsoft Fabric: The Layer Most Teams Forget AI Observability: Stop Flying Blind in Production I love MJML — I just didn't want a whole templating engine for two tiny things Are we still in the Console Era of AI? Building a Senior-Level DevOps / SRE / Infrastructure Engineer Terminal Setup (macOS) Media Queries, Transitions, Positions, and Units (rem vs em) Explained Vibe Coding Will Destroy Your Software Engineering Career Your Payment API Wasn't Built for AI Agents. Open Banking Might Be the Fix. The Amazon Interview Process in 2026: Every Round Decoded (With Copy-Paste Scripts) Why Most Social Platforms Optimize Engagement Instead of Emotional Safety How to Build Your Own AI API Gateway (70x Cheaper Than GPT-4o) OpenBrief Review: Local-First Video AI Summarizer 2026 Announcing LightningChart JS Trader v.4.1 TensorCircuit-NG: Quantum Software On AI, For AI, With AI Open-Source Multi-Agent Orchestration: Lessons from AgentForge AI Agents in Practice — Part 3: How the Control Loop Actually Works Polymarket vs Kalshi: Who Actually Wins on Volume and Liquidity I Wired 8 MCP Servers Into One Claude Agent. 3 Pairs Quietly Fought Over the Same Tool Name. Twenty Minutes, Seventeen Organizations DNSControl + CoreDNS Container Example - Announcement
Serverless Research Paper Intelligence: Docling, Lambda Containers, and Amazon Bedrock
Romina Elena · 2026-05-27 · via DEV Community

1.🚀 Introduction

Processing scientific PDFs is not as simple as extracting text.

Many papers include tables, multiple columns, formulas, figures, and structures that can easily break when we use traditional extractors.
The problem becomes even bigger when those documents are private. We do not always want to depend completely on multimodal models to analyze them, and the cost can also grow quickly when we work with many files.

A few months ago, I attended PyData Berlin and during one of the talks I discovered IBM Docling, an open source project focused on intelligent document processing. What caught my attention the most was its ability to extract structured information from complex PDFs, especially scientific documents with tables, multiple columns, formulas, and layouts that are difficult to process with traditional tools.

From that moment, I started thinking about how to bring this type of processing to the cloud in a simple and scalable way, while also keeping costs under control. Some current solutions for analyzing complex documents with generative AI rely heavily on multimodal models, but in scenarios where we work with large volumes of papers or private documents, cost and privacy can quickly become a problem.

If you have read some of my previous articles, you have probably seen that I like to build content around a real use case. In this tutorial, I decided to work with scientific papers related to research on GLP-1 receptor agonists, a class of medications widely studied for type 2 diabetes and obesity.

These treatments are currently very popular because many people use them for weight loss purposes.


The objective of the tutorial

The idea is not to build a generic search engine over the internet, but something much more interesting: a private knowledge base where you can query only your own research documents in a secure environment.
To solve this, we are going to build an architecture based on:

  • 📦 AWS Lambda Containers
  • 📑 Amazon Bedrock Knowledge Bases
  • 🐣 PDF processing with Docling
  • 🪣 Storage in Amazon S3
  • ✂️ Chunking strategies to improve information retrieval

During the tutorial, I will also show several real problems that I found while implementing this solution:

  • 〰️ size limits in Lambda,
  • 〰️ timeouts caused by model downloads,
  • 〰️ Docker image optimization,
  • 〰️ scientific document processing,
  • 〰️ and architecture decisions to keep a serverless and low cost approach.

The final objective will be to transform a set of scientific papers into a knowledge base that can be queried using natural language. This will allow us to ask questions about adverse effects, clinical criteria, study results, and comparisons between different research papers.


2.🧪 Use case

In this tutorial, we are going to work with a set of scientific papers related to research on GLP-1 receptor agonists (Glucagon-Like Peptide-1), a natural hormone involved in glucose regulation, insulin secretion, and the feeling of fullness.

In recent years, different treatments based on this family of molecules have appeared, and a large number of clinical studies, academic papers, and research documents have been published. These documents are related to cardiovascular outcomes, weight loss, adverse effects, and inclusion or exclusion criteria in clinical trials.

The objective of this use case is not to build a search engine over the internet or use public information in real time. The idea is to work with a private and curated set of scientific documents, simulating a scenario where researchers, medical teams, or research areas need to query only their own papers in a secure environment.

For this MVP, I am going to use 10 public papers as an example dataset, but the architecture is designed for scenarios where the documents can be private or belong to internal research processes.
From these documents, we are going to build a knowledge base that allows queries using natural language, for example:

  • 〰️ identify adverse effects reported in different studies,
  • 〰️ compare results between treatments,
  • 〰️ validate exclusion criteria in clinical trials,
  • 〰️ analyze cardiovascular outcomes,
  • 〰️ retrieve specific information across multiple scientific papers.

3. 🏗️ Solution Architecture

Before going into the theoretical concepts, we are going to describe the solution that we will build.

This solution is based on a serverless architecture that processes scientific papers in PDF format and later uses them as input for an Amazon Bedrock Knowledge Base to build a RAG system.

The architecture clearly separates the ingestion and processing flow from the intelligent query flow, while keeping the solution simple and scalable.

The following blueprint shows how each component connects inside the complete pipeline.

In summary, this pipeline processes PDF files using a Python Docker image with Docling, running inside an AWS Lambda function based on a container image.

This Lambda function transforms the files into structured documents in Markdown.

Then, these documents are stored in Amazon S3 and indexed by Amazon Bedrock, which generates embeddings and allows semantic queries over the content.


4. 📑 Docling: structured document extraction

One of the main challenges when working with scientific PDFs is that they are not “simple” documents. They are full of tables, columns, formulas, figures, and complex layouts that are not always preserved correctly when text is extracted.

IBM Docling is an open source library designed for PDF extraction and document structuring. Its goal is not only to extract text, but also to convert complex documents into a structured representation that can be used in artificial intelligence pipelines and RAG systems.

Instead of returning messy plain text, Docling tries to preserve the structure of the document, including the reading order, tables, formulas, images, and other key elements of the content.

The following image summarizes some of the key benefits of using Docling for complex document processing.


Why use Docling?

Traditional tools like PyPDF, PDFPlumber, or classic OCR are usually enough for simple documents, but they can struggle when working with scientific papers that have complex layouts.

In these cases, important information can be lost, such as:

  • 〰️table structure
  • 〰️ column separation
  • 〰️relationship between text and figures
  • 〰️mathematical formulas

Docling appears as an alternative that tries to solve exactly these problems, generating a much more consistent output for later analysis.


Docling features

Below, you can find the main features published by the library on Hugging Face:

  1. 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
  2. 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images.
  3. 📐 Layout and Localization – Preserves document structure and document element bounding boxes.
  4. 💻 Code Recognition – Detects and formats code blocks including identation.
  5. 🔢 Formula Recognition – Identifies and processes mathematical expressions.
  6. 📊 Chart Recognition – Extracts and interprets chart data.
  7. 📑 Table Recognition – Supports column and row headers for structured table extraction.
  8. 🖼️ Figure Classification – Differentiates figures and graphical elements.
  9. 📝 Caption Correspondence – Links captions to relevant images and figures.
  10. 📜 List Grouping – Organizes and structures list elements correctly.
  11. 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
  12. 🔲 OCR with Bounding Boxes – OCR regions using a bounding box.
  13. 📂 General Document Processing – Trained for both scientific and non-scientific documents.

🏥 Practical example: processing a medical record with Docling

In this example, we will use a synthetically generated clinical record in PDF format to show how Docling can extract and structure information from a healthcare document.

All patient data, medical records, and clinical findings are completely fictional and were created only for educational purposes. No real patient information was used.

This example represents a common use case in the healthcare industry, where medical documents need to be processed, structured, and prepared for analysis with AI.

In the next steps, we will use Docling to:

  • 〰️ load and convert the PDF
  • 〰️ explore the document structure and identify sections
  • 〰️ extract structured patient data into a pandas DataFrame

📌The following image shows part of the clinical record that we will process in this example.


📑 Loading and converting the PDF

In this step, we load the clinical record PDF using Docling's DocumentConverter.

Docling automatically detects the document structure and exports the result in two formats:

  • 〰️Markdown: a human readable output to preview the content
  • 〰️Dictionary: programmatic access to text, tables, images, and metadata

This structured output is what makes Docling more powerful than a basic PDF text extractor.

from docling.document_converter import DocumentConverter, PdfFormatOption
import pandas as pd
converter = DocumentConverter()
result = converter.convert("clinical_history_structured.pdf")
# export markdown
data_markdown = result.document.export_to_markdown()
# export dict
data_dict = result.document.export_to_dict()
texts = data_dict['texts']

Enter fullscreen mode Exit fullscreen mode


🗂️ Exploring document sections

Every clinical record is organized into sections. Here, we extract all the section headers detected by Docling, such as Patient Identification, Chief Complaint, and Laboratory Results.

This gives us:

  • 〰️ A map of the document structure
  • 〰️ The ability to target specific sections for downstream processing
[item['text'] for item in data_dict['texts']  if item['label'] == 'section_header']

Enter fullscreen mode Exit fullscreen mode

['CITYVIEW MEDICAL CENTER CLINICAL HISTORY AND RECORD',
 '1. PATIENT IDENTIFICATION',
 '4. PAST MEDICAL HISTORY',
 '5. MEDICATIONS',
 '6. ALLERGIES',
 '2. CHIEF COMPLAINT',
 '3. HISTORY OF PRESENT ILLNESS',
 '7. FAMILY HISTORY',
 '8. SOCIAL HISTORY',
 '9. REVIEW OF SYSTEMS',
 '10. PHYSICAL EXAMINATION',
 '12. LABORATORY RESULTS',
 '13. ASSESSMENT',
 '14. PLAN',
 '11. IMAGING']

Enter fullscreen mode Exit fullscreen mode


🧩 Extracting patient data as a structured table

Now we extract the content of the first section, Patient Identification, by filtering the items that belong to #/groups/0.

Docling preserves the key value layout of the original PDF, so we can split the flat list into field names and values using Python slice notation.

The result is a clean pandas DataFrame ready for:

  • 〰️ Analysis
  • 〰️ Storage
  • 〰️ Downstream AI processing
# Filter group 0
group_0 = [item['orig'] for item in texts 
           if item.get('parent', {}).get('$ref') == '#/groups/0']
# Convert the flat list into key value pairs
keys   = group_0[0::2]  
values = group_0[1::2]  
df = pd.DataFrame({
    'field': [k.replace(':', '').strip() for k in keys],
    'value': values
})
df

Enter fullscreen mode Exit fullscreen mode


result.document.tables[0].export_to_dataframe()

Enter fullscreen mode Exit fullscreen mode


5. ⚡ AWS Lambda

AWS Lambda is a serverless service that allows you to run code without managing infrastructure. It scales automatically and you only pay for what you use.
It is commonly used for file processing, service integration, scheduled tasks, and real time event processing.

However, even though it is one of the most used services in the serverless ecosystem, some limitations appear quickly when we start working with heavier workloads or complex dependencies.
Some of the main limitations are:

  • 〰️ memory and CPU limits
  • 〰️ maximum execution timeout
  • 〰️ deployment package size restrictions
  • 〰️ the need to use ZIP files or Layers for dependencies
  • 〰️ cold starts in heavier workloads These restrictions mean that, in some cases, traditional Lambda is not enough to run workloads such as intensive PDF processing or libraries with large dependencies.

6. 🐳 AWS Lambda Containers

To solve part of these limitations, AWS Lambda allows you to run functions using container images instead of ZIP packages.
This approach allows you to package the function as a Docker image, push it to Amazon Elastic Container Registry, and run it directly from Lambda.
The main advantage is that it significantly increases the size limit, up to 10 GB. This makes it possible to include heavy dependencies, predownloaded models, or complex libraries like Docling without needing workarounds with Layers.
In this project, this option is key because it allows us to run Docling inside Lambda without compromising dependencies or the runtime.
The following image summarizes the key benefits of using Lambda Containers for this type of workload.


Deploying a Docling Lambda Container to AWS

As we saw in the previous section, the limitations of traditional Lambda make it difficult to run heavy libraries like Docling using ZIP packages or Layers.
To solve this, we are going to run AWS Lambda from a container image. This allows us to package Docling, its dependencies, and its models inside a Docker image, and deploy it using Amazon Elastic Container Registry (ECR).
In this section, we are going to build the image, push it to AWS, and use it inside Lambda to process our scientific papers.
The following image shows the deployment flow that we will follow step by step.


Prerequisites

Before starting, you need to have:

  1. AWS CLI installed and configured
  2. Docker installed with buildx support
  3. Repository cloned locally
  4. Amazon S3 bucket named docling-papers-tutorial, with the PDFs that we are going to process already uploaded
  5. You also need an IAM user with permissions to create images in ECR and deploy Lambda functions. In the repository, you will find JSON files with the required policies inside iam/user_policies.

Github repository

Buy Me A Coffee


Serverless Research Paper Intelligence: Docling, Lambda Containers, and Amazon Bedrock

01-preview


1.🚀 Introduction

The objective of the tutorial

The idea is not to build a generic search engine over the internet, but something much more interesting: a private knowledge base where you can query only your own research documents in a secure environment. To solve this, we are going to build an architecture based on:

  • 📦 AWS Lambda Containers
  • 📑 Amazon Bedrock Knowledge Bases
  • 🐣 PDF processing with Docling
  • 🗑️ Storage in Amazon S3
  • ✂️ Chunking strategies to improve information retrieval

During the tutorial, I will also show several real problems that I found while implementing this solution:

  • 〰️ size limits in Lambda,
  • 〰️ timeouts caused by model downloads,
  • 〰️ Docker image optimization,
  • 〰️ scientific document processing,
  • 〰️and architecture decisions to keep a serverless and low cost approach.

The final objective will be to transform a set of scientific papers into…


Building the Docker image

Once the repository is cloned, we start by configuring the environment variables required for the deployment.

•••••

Setup

Create a .env file with your credentials:

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_DEFAULT_REGION=us-east-1

Enter fullscreen mode Exit fullscreen mode

Then export the variables:

export $(grep -v '^#' .env | xargs)
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity \
--query Account --output text)
export ECR_REPO_NAME=docling-lambda
export LAMBDA_FUNCTION_NAME=docling-lambda
export IMAGE_NAME=docling-lambda

Enter fullscreen mode Exit fullscreen mode

⚠️ Remember to add .env to your .gitignore.

•••••

Step 1: Verify your AWS identity

Before deploying, verify which AWS account and IAM user are currently configured in your environment.

aws sts get-caller-identity

Enter fullscreen mode Exit fullscreen mode

•••••

Step 2: Authenticate Docker with Amazon ECR

This command generates a temporary ECR authentication token and passes it to docker login, so Docker can push images to your private ECR registry.

aws ecr get-login-password --region $AWS_DEFAULT_REGION | \
  docker login --username AWS --password-stdin \
  $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com

Enter fullscreen mode Exit fullscreen mode

⚠️ This token expires after 12 hours. Run this step again if you get authentication errors.

•••••

Step 3: Build the Docker image

Now we build the Docker image from the Dockerfile.

docker buildx build \
  --platform linux/amd64 \
  --provenance=false \
  --sbom=false \
  --no-cache \
  --load \
  -t $IMAGE_NAME .

Enter fullscreen mode Exit fullscreen mode

The most important flags are:

Flag Description
--platform linux/amd64 Forces the x86_64 architecture required by AWS Lambda. This is required if you are building on an Apple Silicon Mac, such as M1, M2, or M3.
--provenance=false Disables build attestation metadata, which can cause issues with Lambda image deployments.
--sbom=false Disables Software Bill of Materials generation, which can also cause issues with Lambda deployments.
--no-cache Builds the image from scratch, ignoring cached layers.
--load Loads the image into your local Docker daemon after building.
-t $IMAGE_NAME Tags the image with the selected image name.
•••••

Step 4: Tag the image for ECR

Before pushing the image, we need to create a new tag that points to the full ECR repository URI.
Docker requires the image name to match the complete ECR URI before it can push the image to the registry.

docker tag $IMAGE_NAME:latest \
$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$ECR_REPO_NAME:latest

Enter fullscreen mode Exit fullscreen mode

•••••

Step 5: Verify that the image exists locally

Before pushing the image to ECR, confirm that it exists in your local Docker environment.

docker images

Enter fullscreen mode Exit fullscreen mode

The image should appear with both tags: the local tag and the ECR tag.

•••••

Step 6: Push the image to ECR

Now we push the image to your private ECR repository.

This step may take several minutes because the Docling image is large due to the ML models included inside the container.

docker push \
$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$ECR_REPO_NAME:latest

Enter fullscreen mode Exit fullscreen mode

•••••

Step 7: Update the Lambda function

Run this step only if you need to update an existing Lambda function with a new image version.

aws lambda update-function-code \
--function-name $LAMBDA_FUNCTION_NAME \
--image-uri $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$ECR_REPO_NAME:latest

Enter fullscreen mode Exit fullscreen mode

This command tells AWS Lambda to use the new image that you just pushed to ECR.
Lambda will pull the image from ECR and deploy it automatically.


7. 🧯 Real problems during the deployment

What I had to solve to run Docling on Lambda

Up to this point, the flow looks relatively simple: build the image, push it to ECR, and deploy the Lambda function.
However, when working with heavy libraries like Docling, several problems started to appear. These problems were related to the image size, the Lambda runtime, and the download of models during execution.
This section summarizes some of the real problems I found during the implementation and the solutions I finally applied.

•••••

Reducing the image size

One of the first problems I ran into was related to the Docker image size. When working with libraries like Docling, which include ML models and multiple heavy dependencies, the final image can grow considerably.
To avoid issues during the build and push process, I added a cleanup step inside the Dockerfile to remove temporary files, __pycache__ folders, and compiled .pyc files.

# Clean up temporary files to reduce image size
RUN find /var/lang/lib/python3.12/site-packages \
-type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true && \
find /var/lang/lib/python3.12/site-packages \
-type f -name "*.pyc" -delete

Enter fullscreen mode Exit fullscreen mode

Although this may look like a small optimization, this type of cleanup helps reduce the final size of images with many Python dependencies.

•••••

Avoiding timeouts and model downloads at runtime

Another important problem appeared during the first executions of the Lambda function.

In the version used in this project, Docling tried to automatically download the models at startup if they were not found locally. This caused timeouts and also created another issue: the Lambda filesystem is read only outside the temporary directory,
which means models cannot be downloaded or saved there at runtime.

To solve this, I decided to predownload the models during the Docker build and store them directly inside the image.

In the Dockerfile, I added the following:

# Copy and run model download script
COPY download_models.py /tmp/download_models.py

RUN mkdir -p /opt/docling-models && \
python3.12 /tmp/download_models.py && \
rm /tmp/download_models.py

Enter fullscreen mode Exit fullscreen mode

The script initializes a DocumentConverter, which forces the required Docling models to be downloaded during the image build instead of during Lambda execution.

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from pathlib import Path

def main():

   artifacts_path = Path("/opt/docling-models")

   pipeline_options = PdfPipelineOptions(
       artifacts_path=artifacts_path,
       do_ocr=False
   )

   converter = DocumentConverter(
       format_options={
           InputFormat.PDF: PdfFormatOption(
               pipeline_options=pipeline_options
           )
       }
   )
if __name__ == "__main__":
   main()
   print("✓ Models downloaded successfully")

Enter fullscreen mode Exit fullscreen mode

With this approach, the models are packaged inside the container and the Lambda function can start much faster, avoiding unnecessary downloads and problems related to the restricted filesystem.


8. 🔁 Orchestrated paper processing

The following function corresponds to the orchestration Lambda. Its goal is to list the papers stored in Amazon S3 and run the processing by invoking the docling-lambda function, which contains the Docker image with Docling.
In this case, the processing is done in a distributed way. Each PDF file is sent individually to the Lambda function responsible for converting the document into Markdown.
In the repository, you will find an implementation similar to the following:

import boto3
import json
from botocore.config import Config

s3 = boto3.client("s3")

lambda_client = boto3.client(
   "lambda",
   config=Config(
       read_timeout=900,
       connect_timeout=10,
       retries={"max_attempts": 0}
   )
)

BUCKET = "docling-papers-tutorial"
DOCLING_LAMBDA = "docling-lambda"

def lambda_handler(event, context):

   # List PDFs
   response = s3.list_objects_v2(
       Bucket=BUCKET
   )

   pdfs = [
       obj["Key"]
       for obj in response.get("Contents", [])
       if obj["Key"].endswith(".pdf")
   ]

   print(f"PDFs found: {len(pdfs)}")

   results = []

   for pdf_key in pdfs:

       s3_url = f"s3://{BUCKET}/{pdf_key}"

       print(f"Processing: {s3_url}")

       response = lambda_client.invoke(
           FunctionName=DOCLING_LAMBDA,
           InvocationType="RequestResponse",
           Payload=json.dumps({"s3_url": s3_url})
       )

       result = json.loads(response["Payload"].read())

       results.append({
           "input": s3_url,
           "output": result.get("output"),
           "status": result.get("status")
       })

       print(f"{pdf_key}{result.get('output')}")

   return {
       "processed": len(results),
       "results": results
   }

Enter fullscreen mode Exit fullscreen mode

Once the function is deployed, we can execute the Lambda function and analyze the results from CloudWatch.


In my tests with these 10 papers, the average was approximately 3.8 seconds per page. This can vary significantly depending on document complexity

This confirms something important: the processing time depends much more on the complexity of the content, such as tables, images, multiple columns, or figures, than on the file size or the number of pages.


9. 🟩 Amazon Bedrock Knowledge Base

Now we are going to build our knowledge base, but first it is important to understand what a Knowledge Base is inside Amazon Bedrock and why it is key in this type of RAG architecture.

•••••

What is a Knowledge Base?

In simple terms, a Knowledge Base is a layer that connects private data with artificial intelligence models, so they can use that information as context to answer questions.

•••••

What is a Knowledge Base in Amazon Bedrock?

In Amazon Bedrock, a Knowledge Base is a fully managed service that allows you to build RAG systems over your own data.
This means that models can query information stored in a knowledge base to generate more accurate and contextualized answers based on private data.
The following image summarizes the key benefits of using Amazon Bedrock Knowledge Bases in this type of architecture.

Also, it includes capabilities such as:

  • 〰️ automatic embedding management
  • 〰️ context management
  • 〰️ source attribution in the answers
  • 〰️ direct integration with private data
•••••

Supported data sources

A Knowledge Base in Amazon Bedrock can connect to different data sources:

  • 〰️ 🪣 Amazon S3
  • 〰️ 🟦 Confluence depending on availability
  • 〰️ ☁️ Salesforce depending on availability
  • 〰️ 📑 Custom data sources
  • 〰️ 🕸️ Web Crawler

Availability can depend on the AWS Region and account configuration.

In this use case, we are mainly going to work with Amazon S3, where we store the documents processed with Docling.

•••••

Chunking: how information is divided

One of the most important concepts when building a Knowledge Base is chunking, which is the process of dividing documents into smaller parts called chunks.
This is necessary because models have context limitations and cannot process very long documents all at once.

We can understand this from two perspectives:

  • 〰️ Context limit: models can only handle a limited number of tokens
  • 〰️ Efficient search: dividing the content allows the system to retrieve more precise information faster In this project, chunking is key because we are working with scientific papers, where the context between sections is very important, for example: results, methods, and adverse effects. The following image describes the main chunking features available in Amazon Bedrock Knowledge Bases.

•••••

Step by step configuration

Creating the Knowledge Base in Amazon Bedrock

In this section, we are going to configure the Knowledge Base in Amazon Bedrock using the papers processed with Docling and stored in Amazon S3 in Markdown format.
The chunking strategy selected for this use case is Hierarchical Chunking, because it allows us to keep the relationship between document sections, for example results, methods, or adverse effects. This is key when working with scientific papers.

Below, I explain why I did not choose the other strategies and what each one implies:

  • 〰️ ❌ Default: uses the default chunking configuration, which may split content without preserving the full document structure.
  • 〰️ ❌ Fixed size: similar to the default strategy, but configurable. It still has the same problem of losing context.
  • 〰️ ❌ Semantic: groups content by semantic similarity. It can be useful, but it may add extra processing time and can be less predictable depending on the documents.
  • 〰️ ❌ No chunking: useful when documents are already small or manually preprocessed into meaningful units.
  • 〰️ ✅ Hierarchical: keeps a parent child structure, allowing each chunk to preserve its context inside the document.
•••••

Prerequisites and permissions

To create a Knowledge Base, you need to consider the following permissions:

  • 〰️ IAM: create or select roles with the right permissions
  • 〰️ Bedrock: access to Knowledge Bases and embedding models
  • 〰️ S3: access to the bucket where the processed documents are stored
  • 〰️ KMS: optional, for data encryption
  • 〰️ Lambda: optional, for custom data transformations

⚠️ AWS does not support creating a Knowledge Base using root user credentials — you need an IAM user or role with the right permissions. Permission configuration is usually one of the most delicate parts of this type of architecture.

_______________

Step 1: Create the Knowledge Base

In Amazon Bedrock, go to the Knowledge Bases section and select Create knowledge base with vector store.
Complete the configuration:

  • Name: docling-glp1-papers-kb
  • Description: Knowledge base with GLP-1 papers processed with Docling and Lambda
  • IAM Role: AmazonBedrockExecutionRoleForKnowledgeBase-docling
  • Data source: Amazon S3

_______________

Step 2: Configure the data source

Configure the data source with the following values:

  • 〰️ Source name: docling-glp1-papers-ds
  • 〰️ S3 path: s3://docling-papers-tutorial/output/
  • 〰️ Parsing strategy: Amazon Bedrock default parser
  • 〰️ Chunking strategy: Hierarchical chunking

_______________

Step 3: Vector store and embeddings

Here, we are going to select the model that we will use to create the RAG system and the destination where the information will be stored.

  • 〰️ Embeddings model: Titan Text Embeddings v2
  • 〰️ Vector store: Amazon S3 Vectors

In this case, we use on demand mode, although other models are available depending on the use case.
After that, we select the Amazon S3 bucket used by S3 Vectors to store the vector index.

To better understand how this type of storage works, you can check a previous article I wrote:

_______________

Step 4: Data synchronization

Once the Knowledge Base is created, its initial status will be Available.
To load the documents, you need to run a manual synchronization:

  • 〰️ Go to the Knowledge Base
  • 〰️ Select the data source
  • 〰️ Click on Sync This synchronization processes the documents, generates the required embeddings, and makes the content available for natural language queries.


10. 🟩 BEDROCK: Test the Knowledge Base

Now we return to the point where we left off a few steps ago: testing our Knowledge Base with the processed papers.

The idea in this stage is to validate whether the system can retrieve relevant information from the 10 scientific papers that we previously loaded and processed.

To do this, we are going to ask some questions focused on clinical analysis and study comparison:

  • 〰️ What gastrointestinal adverse effects were reported in semaglutide clinical trials and what were the incidence rates?

  • 〰️ What were the cardiovascular outcomes reported in semaglutide clinical trials and which patient populations benefited most?

  • 〰️ How does semaglutide compare to liraglutide and tirzepatide in terms of weight loss efficacy and adverse effects across the clinical trials?

These queries allow us to evaluate how the system retrieves specific information across different studies, especially in scenarios where the results are distributed across multiple documents.


11. 🎯 Conclusions

This MVP shows that it is possible to build a queryable knowledge base over private scientific documents using AWS serverless services together with open source tools like Docling.

What I learned while building this system:

  • 〰️ The chunking strategy matters more than it may seem. In the case of scientific papers, Hierarchical Chunking preserves the context between sections such as Results or Adverse Effects better than fixed token based strategies.
  • 〰️ Docling can help reduce the cost and complexity of preprocessing when working with complex PDFs, especially those with tables, columns, and non linear structures. It allows us to convert these documents into structured information ready to be used in AI systems. *〰️ Embeddings are not the same as security. Even though we work with vector representations, research has shown that in some scenarios it is possible to infer or reconstruct sensitive information from embedding vectors. Because of this, treating vector stores as sensitive data and applying access controls and encryption is a good practice in real scenarios.

If we take this to a production environment, three pieces become fundamental:

  • 〰️ CI/CD pipelines are necessary to automate processing and system updates as improvements are added.
  • 〰️ Infrastructure as Code with Terraform, or similar tools, is key to replicate, scale, and maintain the environment consistently across different stages.
  • 〰️ Any solution that is deployed, especially one that uses AI models, should include observability systems to detect and solve problems in production. In terms of impact, this type of solution opens a very relevant space in industries such as healthcare and research, where controlled access to large volumes of knowledge can significantly accelerate scientific analysis and decision making.

Finally, beyond the tools used, the most interesting part of this architecture is how it combines different cloud services and generative AI capabilities to solve a very concrete problem: converting unstructured information into accessible, private, and queryable knowledge using natural language.



12. 📚 Technical references

  1. Amazon Web Services. (n.d.). AWS Lambda Developer Guide. AWS Documentation. Retrieved May 26, 2026, from https://docs.aws.amazon.com/es_es/lambda/latest/dg/welcome.html

  2. Amazon Web Services. (n.d.). Create a Lambda function using a container image. AWS Documentation. Retrieved May 26, 2026, from https://docs.aws.amazon.com/lambda/latest/dg/images-create.html

  3. Amazon Web Services. (n.d.). Amazon Bedrock Knowledge Bases. Retrieved May 26, 2026, from https://aws.amazon.com/es/bedrock/knowledge-bases/

  4. IBM. (n.d.). Docling. Retrieved May 26, 2026, from https://www.docling.ai/

  5. Docling Project. (n.d.). SmolDocling 256M preview. Hugging Face. Retrieved May 26, 2026, from https://huggingface.co/docling-project/SmolDocling-256M-preview

  6. University of Utah Health. (2026). GLP 1 FAQs answered by weight loss experts. Retrieved from https://healthcare.utah.edu/healthfeed/2026/03/preguntas-frecuentes-sobre-el-glp-1-respondidas-por-expertos-en-perdida-de-peso


13. 📄 Research papers used in the use case

  1. Han, S. H., Safeek, R., Ockerman, K., Trieu, N., Mars, P., Klenke, A., Furnas, H., & Sorice Virk, S. (2023). Public interest in the off label use of glucagon like peptide 1 agonists (Ozempic) for cosmetic weight loss: A Google Trends analysis. Aesthetic Surgery Journal. https://doi.org/10.1093/asj/sjad211

  2. Ryan, N., & Savulescu, J. (2026). The ethics of Ozempic and Wegovy. Journal of Medical Ethics, 52(3), 185–193. https://doi.org/10.1136/jme-2024-110374

  3. Mailhac, A., Pedersen, L., Pottegård, A., Søndergaard, J., Mogensen, T., Sørensen, H. T., & Thomsen, R. W. (2024). Semaglutide (Ozempic®) use in Denmark 2018 through 2023: User trends and off label prescribing for weight loss. Clinical Epidemiology. https://doi.org/10.2147/CLEP.S456170

  4. Manoharan, S. V. R. R., & Madan, R. (2024). GLP 1 agonists can affect mood: A case of worsened depression on Ozempic (Semaglutide). Case Reports in Psychiatry. https://pmc.ncbi.nlm.nih.gov/articles/PMC11208009/

  5. Humphrey, C. D., & Lawrence, A. C. (2023). Implications of Ozempic and other semaglutide medications for facial plastic surgeons. Facial Plastic Surgery. https://doi.org/10.1055/a-2148-6321

  6. Pillarisetti, L., & Agrawal, D. K. (2025). Semaglutide: Double edged sword with risks and benefits. Archives of Internal Medicine Research, 8(1), 1–13. https://doi.org/10.26502/aimr.0189

  7. Fong, S., Carollo, A., Lazuras, L., Corazza, O., & Esposito, G. (2024). Ozempic (Glucagon like peptide 1 receptor agonist) in social media posts: Unveiling user perspectives through Reddit topic modeling. Dialogues in Health. https://www.sciencedirect.com/science/article/pii/S2667118224000163

  8. Carboni, A., Woessner, S., Martini, O., Marroquin, N. A., & Waller, J. (2024). Natural weight loss or “Ozempic Face”: Demystifying a social media phenomenon. Journal of Drugs in Dermatology, 23(1). https://doi.org/10.36849/JDD.7613

  9. Grech, V. S., Lotsaris, K., Grech, I., & Kefala, V. (2024). Semaglutide (Ozempic) and obesity: A comprehensive guide for aestheticians. Review of Clinical Pharmacology and Pharmacokinetics, 38(Suppl. 1), 31–35. https://www.researchgate.net/publication/378300594_Semaglutide_Ozempic_and_obesity_A_comprehensive_guide_for_aestheticians

  10. Vambe, S. D., Zulu, W., Hough, E., & Luvhimbi, M. J. (2024). Semaglutide (Ozempic®): A comprehensive review of its pharmacology, efficacy, and safety profile in type 2 diabetes mellitus and weight management. SA Pharmaceutical Journal, 91(6), 31–34. https://www.researchgate.net/publication/388790459_Semaglutide_Ozempic_R_a_comprehensive_review_of_its_pharmacology_efficacy_and_safety_profile_in_type_2_diabetes_mellitus_and_weight_management