惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Project Zero
Project Zero
F
Fortinet All Blogs
Recent Announcements
Recent Announcements
云风的 BLOG
云风的 BLOG
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
S
SegmentFault 最新的问题
Blog — PlanetScale
Blog — PlanetScale
T
Tailwind CSS Blog
WordPress大学
WordPress大学
Engineering at Meta
Engineering at Meta
S
Schneier on Security
N
News and Events Feed by Topic
N
News | PayPal Newsroom
H
Help Net Security
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
The Exploit Database - CXSecurity.com
Attack and Defense Labs
Attack and Defense Labs
博客园 - Franky
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
A
About on SuperTechFans
AWS News Blog
AWS News Blog
S
Secure Thoughts
The Cloudflare Blog
Hugging Face - Blog
Hugging Face - Blog
爱范儿
爱范儿
C
Cybersecurity and Infrastructure Security Agency CISA
V2EX - 技术
V2EX - 技术
Recorded Future
Recorded Future
Microsoft Azure Blog
Microsoft Azure Blog
博客园_首页
MyScale Blog
MyScale Blog
Martin Fowler
Martin Fowler
Help Net Security
Help Net Security
人人都是产品经理
人人都是产品经理
Latest news
Latest news
C
Cyber Attacks, Cyber Crime and Cyber Security
大猫的无限游戏
大猫的无限游戏
The Last Watchdog
The Last Watchdog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
月光博客
月光博客
H
Hacker News: Front Page
P
Proofpoint News Feed
N
News and Events Feed by Topic
H
Heimdal Security Blog
L
Lohrmann on Cybersecurity
有赞技术团队
有赞技术团队
L
LangChain Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Generating Synthetic Enterprise Datasets for AI Systems
Irvan Gerhana Septiyana · 2026-06-25 · via DEV Community

Part 2 of the Building Enterprise AI Automation Systems Series


Introduction

One of the biggest obstacles in enterprise AI is not choosing a model.

It is finding data.

Most tutorials assume that training data already exists.

Reality is very different.

Large organizations rarely share operational datasets.

Financial transactions contain confidential information.

Contracts contain sensitive agreements.

Invoices reveal commercial relationships.

Bank statements expose customer activity.

For legal, regulatory, and competitive reasons, these datasets almost never become public.

This creates a difficult problem for AI engineers.

How do you build intelligent systems when the data you need cannot be accessed?

The answer is synthetic data.

Unfortunately, most synthetic datasets found online are little more than randomly generated CSV files.

They contain names.

Numbers.

Dates.

But they completely ignore something far more important:

Business relationships.

In this article, we'll explore how to design synthetic enterprise datasets that preserve real business logic and can be used for machine learning, automation, benchmarking, and AI engineering.


Random Data Is Not Synthetic Data

Many developers believe synthetic data simply means generating fake values.

For example:

Customer,Invoice,Amount
John,INV001,500
Alice,INV002,1200
Bob,INV003,900

Technically, this is synthetic.

Practically, it is useless.

Why?

Because enterprise systems are built around relationships.

Invoices belong to contracts.

Contracts belong to customers.

Payments reference invoices.

Purchase orders authorize invoices.

Bank transactions settle invoices.

Without these relationships, there is nothing meaningful to learn.

A machine learning model trained on isolated records learns isolated patterns.

Real enterprise automation requires connected data.


Thinking Like an Enterprise System

Before writing a single line of Python, ask one question:

"How does the business actually operate?"

Imagine a manufacturing company.

A customer signs a contract.

The contract defines:

  • products,
  • payment schedules,
  • milestones,
  • currencies,
  • pricing.

Invoices are generated from the contract.

Purchase orders authorize procurement.

Eventually, a payment appears in a bank statement.

That payment is never independent.

It always belongs to a business process.

Therefore our synthetic dataset must preserve that process.


Designing the Data Model

Rather than generating random tables, begin by designing business entities.

For this project, the core entities were:

Customer
        │
        ▼
Contract
        │
        ▼
Invoice
        │
        ▼
Bank Transaction

This hierarchy reflects real enterprise operations.

Every entity inherits context from its parent.


Customer Master

The customer master acts as the source of truth.

Example:

{
  "customer_id":"CUS-00002",
  "legal_name":"ALPHABRIDGE SOLUTIONS",
  "country":"United States",
  "industry":"Manufacturing"
}

Customers rarely change.

Everything else references them.


Contract Master

Contracts establish commercial relationships.

Example:

{
  "contract_id":"CNT-2024-587",
  "customer_id":"CUS-00002",
  "billing_schedule":"Monthly",
  "currency":"EUR"
}

Notice that contracts reference customers.

Never duplicate customer information.

Use identifiers.


Invoice Master

Invoices inherit context from contracts.

{
  "invoice_number":"MFG-INV-000157",
  "contract_id":"CNT-2024-587",
  "customer_id":"CUS-00002",
  "amount":3979.85
}

Again, relationships matter more than values.


Bank Statements

Only after customers, contracts, and invoices exist should transactions be generated.

Example narrative:

PART PMT ALPHABRIDGE SOLUTIONS MFG-INV-000157

Notice that the narrative references existing business entities.

This is the difference between realistic synthetic data and random text generation.


Why Relationships Matter

Suppose an invoice references:

MFG-INV-000157

That invoice should always resolve to:

Customer
↓

Contract
↓

Invoice

Otherwise:

  • Entity Resolution cannot be evaluated.
  • Reconciliation cannot be validated.
  • Ground truth disappears.

Synthetic data must preserve referential integrity.


Building Ground Truth

One advantage of synthetic data is complete control.

Every generated transaction already knows:

  • which customer owns it,
  • which contract created it,
  • which invoice it references,
  • whether it is a partial payment,
  • whether reconciliation should succeed.

This hidden knowledge becomes ground truth.

Ground truth enables benchmarking.

Instead of asking:

"Did the model perform well?"

we can ask:

"Did the model recover the correct business relationship?"

This is a much stronger evaluation.


Simulating Real-World Noise

Real enterprise data is messy.

Invoices are not always written consistently.

Examples:

INV-001
INV001
INV 001
INVOICE-001

Customer names evolve:

ALPHABRIDGE SOLUTIONS
ALPHABRIDGE LTD
ALPHA BRIDGE
ABS

Synthetic datasets should deliberately include this variability.

Otherwise models learn perfect data instead of realistic data.

The goal is not to make the dataset clean.

The goal is to make it believable.


Balancing Entity Distribution

Another common mistake is imbalance.

Imagine a dataset containing:

Invoice Labels : 50,000
Contract Labels : 35
Purchase Orders : 40

A transformer will naturally learn invoices better than contracts.

The issue is not the model.

It is the dataset.

Balanced entity distribution improves learning quality and produces more reliable evaluation metrics.

Synthetic generation should therefore control not only volume, but also diversity.


Why Synthetic Data Enables Better AI

Once relationships exist, a single synthetic dataset can support multiple AI tasks.

For example:

Named Entity Recognition

Extract:

  • Customer
  • Invoice
  • Contract
  • Purchase Order

Entity Resolution

Resolve:

ALPHABRIDGE

↓

CUS-00002


Reconciliation

Determine whether a payment correctly settles an invoice.


Agentic Workflows

Trigger downstream actions:

  • approve,
  • escalate,
  • notify,
  • reconcile,
  • update ERP.

The same dataset becomes reusable across multiple machine learning tasks.


Lessons Learned

After generating hundreds of thousands of synthetic enterprise transactions, one lesson became obvious.

Volume alone is meaningless.

Relationships matter.

Business logic matters.

Ground truth matters.

If your synthetic dataset behaves like a real business, your AI system learns to solve real business problems.

If your synthetic dataset behaves like random CSV files, your AI system learns randomness.


Conclusion

Synthetic data is not a shortcut.

It is an engineering discipline.

Well-designed synthetic datasets preserve business logic, entity relationships, referential integrity, and realistic variability.

These characteristics make them valuable not only for machine learning but also for benchmarking, software testing, API validation, and enterprise automation.

In the next article, we'll use this synthetic dataset to build a Financial Named Entity Recognition (NER) pipeline capable of understanding enterprise bank transaction narratives and transforming them into structured business knowledge.


Next Article

Part 3 — Building a Financial Named Entity Recognition Pipeline Using Doccano and IndoBERT

We'll cover:

  • Designing a business taxonomy
  • Automatic pre-labeling
  • Annotation guidelines
  • Doccano workflow
  • BIO tagging
  • Fine-tuning IndoBERT
  • Evaluating precision, recall, and F1-score
  • Preparing data for entity resolution