惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

EDB

Switchover and Switchback of CloudNativePG Replica Clusters in a Distributed Topology (K8s) - Part 2 Webinars CWO Society Dinner for FSI From VMs to Kubernetes: A DBA's Journey in a Large Global Bank Webinars Navigating Disruption: Architecting Your Sovereign Data Estate for Resiliency Sovereignty Is the New Operating System for Agentic AI, New MIT Technology Review Insights Report Finds Beyond the DBaaS Trap: Achieving Data Sovereignty with Kubernetes and CloudNativePG Webinars Red Hat Ansible Automates: Washington DC OpenShift Showcase: Toronto 소버린 AI 전문가와 함께하는 EDB 웨비나 コンテナ化の運用の壁をどう超えるか? 〜デプロイ・保守を自動化し、リソース負担を最小化する次世代DB運用戦略〜 コンテナ化の運用の壁をどう超えるか 〜デプロイ・保守を自動化し、リソース負担を最小化する次世代DB運用戦略〜 A Day in the Life: Inside a Director of Sales Development Role at EDB Taller: Creación de una plataforma de análisis soberana a gran escala con EDB Postgres AI Workshop: Building a Sovereign Analytics Platform at Scale with EDB Postgres AI Building Real-Time, Data-Aware Intelligence with Postgres and the Model Context Protocol Yogesh Jain POSETTE How Euronext FX Built the Data Foundation for a New Era of Electronic Trading EDB Postgres® AI: The Sovereign Data and AI Platform for the Agentic Enterprise HOW2026 Data, Trust, and the New Rules of AI EDB at Red Hat Summit 2026: Building AI on Ground You Own A Day in the Life at EDB: Inside a Director of Customer Success Role at EDB PostgreSQL vs MySQL: Migration Without the Migraine DIVA (Dive into AI) 2026 Club des Utilisateurs Français d’EDB Postgres (CUFEP) 2026 EDB Delivers “Intelligence per Watt” Paradigm to Slash Token Consumption and Cut Data Center Emissions by up to 87% EDB Postgres AI on OpenShift cluster using CSI driver for Dell PowerFlex EDB Japan eridai takashi EDB Spearheads the Year of the Agentic Workforce with Industry Recognition, Ecosystem Momentum, and Continued Postgres® Leadership A Strategic Roadmap for Oracle to Postgres Migration at Ooredoo Webinars Webinars Sovereign Data Strategies: Scaling Postgres for Large Scale Analytics Sovereign Data Strategies: Scaling Postgres for Large Scale Analytics Deployment of PostgreSQL Replica Cluster via Barman Cloud Plugin on CloudNativePG - Part 1 Making AI Work for Your Business PGDay Armenia Why the World’s Most Stable OS Demands a High-Performance Data Foundation Ava Chawla MySQL to PostgreSQL Migration EDB Postgres® AI Delivers Superior Predictability vs. Cloud Data Warehouses in High-Concurrency Benchmark, Unveils Q1 Platform Updates to Power the Agentic AI Era The Agentic Confusion: Why I Keep My Postgres Control Plane Deterministic The Next Generation of EDB Postgres AI Factory: Built for the Agent Era Why Your Analytical Database Needs Multiple Clusters to Do What WarehousePG Does With One Driving the Next Digital Experience Chris Chiappone
AI Data Pipeline Automation with AIDB
christina.ma · 2026-05-18 · via EDB

AIDB is EDB's Postgres extension that automates the entire AI data preparation pipeline — chunking, embedding, and vector indexing — triggered live inside the database the moment new data arrives. To show it in action, I used an investigation story PDF: a real document full of players, events, and relationships that needed to become a queryable knowledge base without any manual data wrangling.
 

What AIDB automates — and why it matters

Every AI pipeline that works with unstructured documents has the same hidden cost: data preparation. Before a single query can be answered, raw text must be cleaned, split into model-friendly chunks, converted into vector embeddings, and indexed for retrieval. In a conventional setup, each of those steps is a separate job — something you write, schedule, monitor, and re-run every time source data changes.

AIDB removes that cost entirely. You declare a preparer and a knowledge base as live database objects. From that moment on, the database owns data preparation. Every INSERT into a source table automatically triggers chunking, embedding, and indexing — with no external scheduler, no ETL script, and no risk of the knowledge base falling out of sync.
 

The core principle: data preparation is not a job you run on a schedule — it is a behaviour the database exhibits automatically on every INSERT. AIDB's Live mode makes this real.


To put this to the test, I took an investigation story PDF — a dense, character-rich document full of named players, a timeline of events, and a web of relationships — and used AIDB to turn it into a fully automated, queryable RAG knowledge base. Here is the exact pipeline, in SQL.
 

The automated pipeline at a glance

PDF file → source_documents → target_preparer (auto) → RAG_KB (auto) → Query ready


Only the first step — loading parsed PDF content into source_documents — involves any manual action. Everything beyond that is owned by AIDB. The target_preparer chunks each passage the moment it arrives. The RAG_KB embeds each chunk immediately after. The investigation story is query-ready before you close your SQL client.
 

Step 1 — load the PDF content into source_documents

The investigation story PDF is parsed externally — each passage, section, or paragraph extracted as a discrete piece of text. Those pieces land as rows in the source table. The table is intentionally simple: just an ID, a part number, a generated unique key, and the raw text.

The generated unique_id column — combining document ID and part number — ensures every passage is traceable all the way through chunking and embedding. When you query RAG_KB later and get a result back, you can trace it directly to the original story passage.

DROP TABLE IF EXISTS source_documents;

CREATE TABLE source_documents (
  id        TEXT,
  part_id   INTEGER NOT NULL,
  unique_id TEXT NOT NULL GENERATED ALWAYS AS
            ((id || '.part.') || part_id) STORED,
  result    TEXT,
  CONSTRAINT source_documents_pkey PRIMARY KEY (unique_id)
);

CREATE INDEX source_documents_id_idx
  ON source_documents (id);

-- confirm story passages are loaded
SELECT * FROM source_documents;

In the investigation story use case, each row represents one passage — a player profile, a scene description, a timeline entry, or a relationship note. The richer and more granular the parsing, the more precise the retrieval. Once the rows are inserted, AIDB automation takes over immediately.

Step 2 — automate chunking with target_preparer

Story passages vary in length and structure. AIDB's target_preparer applies the ChunkText operation to break each passage into consistently-sized, embedding-ready chunks, writing them automatically to the prepared document table. This is the first stage of automated data preparation — configured once, runs forever.

Calling aidb.set_auto_preparer with 'Live' activates the automation. Every subsequent INSERT into source_documents is chunked instantly — no cron job, no Celery worker, no polling loop.

-- idempotent: safe to re-run
SELECT aidb.delete_preparer('target_preparer');

SELECT aidb.create_table_preparer(
  name                    => 'target_preparer',
  operation               => 'ChunkText',
  source_table            => 'source_documents',
  source_data_column      => 'result',
  destination_table       => 'prepared_document',
  destination_data_column => 'chunks',
  source_key_column       => 'unique_id',
  destination_key_column  => 'id',
  options => '{"desired_length": 100}'::JSONB
);

-- Live mode: chunking fires automatically on every INSERT
SELECT aidb.set_auto_preparer('target_preparer', 'Live');

A chunk size of 100 tokens suits narrative content well — specific enough to isolate a player or event, broad enough to carry surrounding context. For the investigation story, this means each chunk typically covers a single character trait, a single scene, or a single relationship link.

Step 3 — automate embedding with RAG_KnowledgeBase

With chunks flowing automatically from the preparer, RAG_KB handles the second stage: embedding. AIDB converts each chunk into a BERT vector and stores it in the backing vector table. Set to Live mode, this fires the moment a new chunk is written — completing the fully automated data preparation chain.

No manual embedding run. No batch job. The entire investigation story — every player, every event, every relationship — is encoded as semantic vectors inside Postgres, continuously and automatically.

-- idempotent: safe to re-run
SELECT aidb.delete_knowledge_base('RAG_KB');

SELECT aidb.create_table_knowledge_base(
  name               => 'RAG_KB',
  model_name         => 'bert',
  source_table       => 'prepared_document',
  source_data_column => 'chunks',
  source_data_format => 'Text',
  source_key_column  => 'unique_id'
);

-- Live mode: embedding fires automatically on every new chunk
SELECT aidb.set_auto_knowledge_base('RAG_KB', 'Live');

-- verify vectors are populated
SELECT * FROM RAG_KB_vector LIMIT 10;

The pipeline is now fully live. Insert a new passage from the investigation story into source_documents — new testimony, a newly discovered document, an updated player profile — and it flows through target_preparer and into RAG_KB automatically. No further action needed.

The result: querying the story like a database

With the automated pipeline live, RAG_KB can be queried using natural language — passed directly to AIDB's semantic retrieval function. Ask about a player by name, a type of event, a location, or a relationship between parties. The retrieval is similarity-based, not keyword-based: even if the query words differ from the original text, AIDB finds and ranks the most relevant passages.

This is what AIDB's automated data preparation makes possible. The investigation story stops being a static PDF and becomes a structured, searchable intelligence layer — one that stays current automatically as new material is added.
 

Summary: what AIDB automates end to end

The investigation story example demonstrates a general capability. Any document — PDF, report, case file, research paper, HTML, etc. — can be loaded into source_documents and immediately become part of a live, automated RAG pipeline. AIDB handles the rest:

Chunking — target_preparer splits raw text into model-ready chunks automatically on INSERT.

Embedding — RAG_KB converts each chunk to a vector automatically as chunks arrive.

Indexing — vectors are stored and indexed inside Postgres, with no external vector store required.

Freshness — the knowledge base is always in sync with source data. No scheduled re-runs, no drift.

Swap bert for any AIDB-supported model — OpenAI embeddings, a local Ollama model, or a fine-tuned domain model — without changing the pipeline. target_preparer and RAG_KB are model-agnostic by design.