惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
The GitHub Blog
The GitHub Blog
Vercel News
Vercel News
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
酷 壳 – CoolShell
酷 壳 – CoolShell
Last Week in AI
Last Week in AI
F
Fortinet All Blogs
Jina AI
Jina AI
I
InfoQ
T
The Blog of Author Tim Ferriss
P
Proofpoint News Feed
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
V
Visual Studio Blog
L
LangChain Blog
WordPress大学
WordPress大学
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
T
Tor Project blog
GbyAI
GbyAI
MongoDB | Blog
MongoDB | Blog
V
V2EX
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
Recorded Future
Recorded Future
N
News and Events Feed by Topic
云风的 BLOG
云风的 BLOG
Martin Fowler
Martin Fowler
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
罗磊的独立博客
O
OpenAI News
Google DeepMind News
Google DeepMind News
S
Schneier on Security
C
Check Point Blog
N
Netflix TechBlog - Medium
The Register - Security
The Register - Security
aimingoo的专栏
aimingoo的专栏
TaoSecurity Blog
TaoSecurity Blog
T
Tenable Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Hugging Face - Blog
Hugging Face - Blog
Cyberwarzone
Cyberwarzone
月光博客
月光博客
The Last Watchdog
The Last Watchdog
B
Blog
有赞技术团队
有赞技术团队
Blog — PlanetScale
Blog — PlanetScale
T
Tailwind CSS Blog
Hacker News: Ask HN
Hacker News: Ask HN
H
Heimdal Security Blog
美团技术团队

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Messy Enterprise Data Is Not a Blocker Anymore
Iteration La · 2026-05-17 · via DEV Community

The New Rule Is Not "Clean Everything First"

The supplier invoices are in email. The signed forms are scans. The onboarding packet has a PDF, two spreadsheets, and a photo of a handwritten note. The customer record in the ERP has an old address, and the contract folder has three versions with slightly different dates.

Many automation projects turn into data cleanup projects right there. Everyone agrees the workflow would help, but the first proposed milestone is "standardize the inputs" or "centralize the documents" or "clean the source system first."

That advice sounds responsible. It is also why useful document workflows sit untouched for months.

Enterprise AI changes the order of operations. The workflow still has to be honest about bad inputs, missing values, and contradictions. But it can start by reading the files where the work already happens, preserving the evidence, and routing uncertainty before anything reaches a downstream system.

The Stanford Digital Economy Lab's 2026 Enterprise AI Playbook gives this shift real evidence.

"Only 6% of implementations had data that was fully ready for AI deployment."

"Now, 91% of our implementations successfully processed unstructured data, including voice transcripts, scanned documents, images, chat logs, and legacy code."

The finding is not permission to ignore data quality. It is a warning against spending the first year cleaning data that the workflow may not even need.

Access Beats Centralization at the Start

Many teams confuse three different states: stored, centralized, and usable.

A document can be stored but not usable. The workflow needs a representation that matches the next action.

Current state Why it is not enough Useful workflow representation
Scanned invoice in object storage The file exists, but values and confidence are not available Typed fields with citations and review state
200-page PDF in SharePoint The document is accessible, but sections and tables are hard to route Markdown with headings, tables, and page context
Folder of signed forms The evidence is present, but business fields are not normalized Extracted fields tied back to source pages

Centralization can help later, but it is not always the first useful move.

Stanford's report puts it plainly:

"Success did not require centralization. It required access."

For automation builders, that sentence is practical. A useful first version might connect to the inbox, shared drive, portal, or storage bucket where the work already arrives. The workflow can convert long documents to Markdown for context, extract typed fields for operations, route uncertain values to review, and generate the spreadsheet, PDF, or task the team needs.

The first version is not a perfect source of truth. It is a controlled access layer around the current mess.

Messy Inputs Are Not One Category

"Messy data" is too broad to be useful as an engineering category. A scanned page, a mixed packet, and a stale supplier record fail in different ways.

Common document messes need different handling.

Mess Workflow choice
Scanned pages Run OCR, but preserve layout, confidence, and source citations
Mixed packets Classify document parts before extraction or generation
Tables Preserve row and column relationships instead of raw text order
Version drift Extract business meaning across old and new templates
Handwriting Route uncertain fields based on risk, not document-level confidence
Reference mismatch Compare extracted values against catalog or ERP records
Partial completion Continue usable fields while missing or uncertain fields become workflow state

Treating all of this as one cleanup problem leads to brittle automation. A better workflow asks what representation each step needs before it decides what to clean.

If the next step is RAG, the workflow may need clean Markdown with headings, tables, and page context. The document-to-markdown guide for RAG explains why table structure and section context matter before embeddings happen.

If the next step is an approval workflow, the system needs typed fields, confidence scores, citations, and validation rules. If the next step is a generated client report, the system needs approved values, not raw candidates.

Evidence Makes Messy Data Operational

Messy data becomes dangerous when the workflow hides uncertainty.

An AI step extracts EUR 4,283.50 from a scanned invoice. The number looks precise, but the workflow still needs to know whether the decimal separator was clear, whether a similar subtotal appeared nearby, and whether a correction note changed the amount. A human operator knows to ask those questions. A workflow needs signals that represent them.

Without confidence, the workflow has two bad options: trust everything or review everything. Trusting everything sends bad values into accounting, CRM, compliance, or customer-facing artifacts. Reviewing everything removes the efficiency that made the workflow worth building.

Field-level confidence creates a third path:

Field condition Workflow action
Required field is high confidence Continue automatically
Required field is missing Stop and request input
Money, identity, or consent field is uncertain Route to review
Optional note is uncertain Store with uncertainty metadata
Source contradicts reference data Escalate as exception

The confidence score guide for human review covers the review architecture. For messy enterprise data, uncertainty should become workflow state, not a hidden model detail.

Forms Are the Small Version of the Enterprise Problem

Forms make the pattern obvious because a blank form lies.

The template looks structured until real submissions arrive. People write outside the box, attach older versions, cross out checkboxes, leave required fields blank, use local date formats, scan at an angle, and photograph pages on a kitchen table.

The right extraction workflow asks for business meaning, not coordinates. It asks for applicant name, date of birth, consent status, requested amount, policy number, supplier tax ID, or signature date. Then it routes fields based on risk.

Messy forms need trustworthy fields for the same reason. A moved checkbox should not silently return the wrong boolean. It should become an uncertain consent field with a review rule.

Supplier onboarding packets, insurance claims, loan applications, patient referrals, property listing packs, and legal exhibits all follow the same pattern. The source material is not clean, so the workflow has to be honest about it.

Store Source Records, Not Just Clean Results

One hidden danger in data cleanup projects is deleting the evidence too early.

If a workflow only stores the final clean value, it becomes hard to explain decisions later. A reviewer corrects a due date, a customer disputes an address, or a supplier says the bank account was changed. At that point the team needs the source record, not just the cleaned field.

The workflow needs more than the extracted value. It needs enough record structure to explain the value.

For document automation, useful records often include:

  • Source document identifier.
  • Processing timestamp.
  • Schema version.
  • Extracted value.
  • Confidence score.
  • Source citation.
  • Validation result.
  • Review status.
  • Approved value.
  • Generated artifact reference.
  • Delivery status.

This does not mean every processor should retain every file. Retention has to match privacy, security, and client requirements. A zero-retention processing layer can discard files after processing while the application stores the business record needed to explain the decision.

Workflow memory belongs in the workflow, not inside the model response.

Where Cleanup Still Matters

Messy data is not a free pass.

Some inputs are too poor to use. A scan may be unreadable, a document may be the wrong type, a table may be missing required columns, or a form may contain contradictory answers. Some fields are too consequential to accept automatically even at high confidence.

Reference data still matters too. If the supplier catalog contains duplicate IDs or stale payment details, extraction cannot make the downstream decision safe by itself. The workflow can flag the mismatch, but the business still needs an owner for the source of truth.

The practical order is different from the old advice:

  1. Start with the documents and systems where the work already happens.
  2. Build access and representation for the workflow.
  3. Extract typed fields or Markdown depending on the next step.
  4. Preserve citations and confidence.
  5. Route uncertainty before downstream action.
  6. Use the exceptions to identify which data cleanup work is actually worth doing.

That sequence lets teams learn from real files instead of spending months cleaning data that may never affect the workflow.

Where Iteration Layer Fits

Iteration Layer helps teams work with messy inputs without rebuilding the same processing layer for every workflow.

Document Extraction turns documents into typed fields with confidence scores and source citations. Document to Markdown turns long documents, tables, PDFs, and scans into readable Markdown for RAG and agent context. Generated document and sheet APIs turn approved data into reports, trackers, and client-ready artifacts.

That matters because messy enterprise data is rarely one operation. The workflow usually needs to read the input, preserve evidence, route exceptions, and produce an output another team can use.

Clean data is still valuable. The order changes: start with the files and systems where work already happens, build the access layer, route uncertainty, and let real exceptions show which cleanup work is worth doing next.