惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

DEV Community

The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler Tesla P40 in a Homelab: 24GB of Inference on a Budget Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed I asked three AI models the same API question. Only one had it right. Implementing Saga Pattern With Lambda Durable Function Why does AI forget what you said (and how to fix it) I built a daily Wordle-style game for AI tools - Here's how Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion
Training Data Provenance: The Manifest Diff That Explains the Hash
AI x Crypto · 2026-05-26 · via DEV Community

AI x Crypto Systems disclosure: this article was prepared with AI assistance as an editorial helper. The ideas, facts, code, sources, and conclusions were reviewed by a human.

AI x Crypto Systems disclosure: this article is a technical explanation, not investment advice. AI x Crypto Systems does not recommend buying, selling, or holding any cryptoasset.

Training Data Provenance

Training Data Provenance can look healthy while the dataset story is wrong. In the postmortem below, a model card points to sha256:9f22..., the training file has not changed, and the team still cannot answer why an opt-out record reached training. The incident is not a cryptography problem. The incident is a missing manifest diff.

incident: support-classifier-v7
symptom: opt-out record appears in training explanation sample
dataset_hash: sha256:9f22...
hash_status: correct
missing: source policy, exclusion report, reviewer status
impact: model build cannot prove why the record was included

Enter fullscreen mode Exit fullscreen mode

Symptom

Training Data Provenance starts with a symptom, not a standard. The symptom in this case is a user asking why a support message they opted out of appears in a model-review sample. The team checks the model card, finds the dataset hash, recomputes the digest, and confirms the final archive matches the recorded value. Training Data Provenance has byte identity, yet the review still cannot explain the record.

The symptom matters because it separates two questions that teams often merge. A hash answers "which bytes did we train on?" but the incident asks "why were these bytes allowed to exist in the training set?" Croissant 1.1 allows file objects to carry sha256 checksums, which is useful for byte identity. Training Data Provenance needs that anchor, but the anchor does not supply collection rights, exclusion history, or reviewer intent.

That is the bug.

Triage Note

Training Data Provenance triage should write down what is known before it starts inventing explanations. Known: the final file hash is stable. Unknown: whether the source export included an opt-out list, whether the redaction transform ran after the opt-out merge, whether the reviewer accepted a limitation, and whether the model card copied the right manifest. Training Data Provenance fails when those unknowns are hidden behind a confident digest.

The first triage note should therefore read like this:

Question Current answer Risk
Did the final file change? No; hash matches Byte identity is not the issue
Was the source allowed? Unknown Rights review missing
Did exclusion run? Unknown Opt-out process may be absent
Who approved the manifest? Unknown Reviewer status missing
Can the model build be explained? Partially Hash exists; lineage does not

Manifest Before

Training Data Provenance becomes concrete when the broken manifest is shown. The bad manifest below looks respectable because it has a dataset name, a final digest, and a transform list. It is still not enough. The fields that would explain source policy, exclusion evidence, and reviewer status are either absent or vague.

dataset: support-classifier-training
created_at: 2026-05-22T18:10:00Z
source: support-chat-export
rights_basis: internal
transforms:
  - normalize
  - dedupe-v1
sha256: 9f2277aa...
model_build: support-classifier-v7

Enter fullscreen mode Exit fullscreen mode

Training Data Provenance does not reject this manifest because YAML is bad; Training Data Provenance rejects this manifest because the manifest cannot answer the incident. "Internal" is not a rights basis. support-chat-export is not a source record. dedupe-v1 does not say whether opt-outs were removed. The digest says the file is stable, not that the process was reviewable.

Broken training data manifest postmortem

Diff Evidence

Training Data Provenance improves when the repair is a diff, not an essay. The diff below is the smallest artifact that changes the review. It does not merely add more metadata; it adds the missing causal links: source version, opt-out list, rights policy, redaction transform, exclusion report, reviewer status, and unresolved risks.

 dataset: support-classifier-training
 created_at: 2026-05-22T18:10:00Z
-source: support-chat-export
-rights_basis: internal
-transforms: [normalize, dedupe-v1]
-sha256: 9f2277aa...
+source_records:
+  - support-chat-export@2026-05-22
+  - opt-out-list@2026-05-22
+rights_basis: internal-use-policy-2026-04 + customer-exclusion-log
+transforms:
+  - normalize-v1
+  - pii-redaction-v3
+  - opt-out-removal-v2
+  - dedupe-v2
+exclusion_report: removed-records-8841.json
+reviewer_status: accepted_with_limits
+unresolved_risks:
+  - non-English coverage gap
+  - legacy tickets before consent-policy migration
+sha256: 4d81c0ee...
 model_build: support-classifier-v7

Enter fullscreen mode Exit fullscreen mode

Provenance Model

Training Data Provenance needs a model for the words "source", "transform", and "reviewer." W3C PROV describes provenance through entities, activities, and agents, which is a good mental model for this incident. The support export and opt-out list are entities. Redaction and opt-out removal are activities. The pipeline and reviewer are agents. Training Data Provenance gets better when the manifest names those roles instead of compressing them into one source string.

That model does not need a giant platform on day one. A small manifest can point to source files, transform logs, exclusion reports, reviewer decisions, and model builds. The important property is traceability: a reviewer should be able to walk from a model artifact to a dataset digest to a manifest diff to a source record. Training Data Provenance is useful when the walk is possible without asking someone to remember a pipeline meeting.

Rights Layer

Training Data Provenance needs a rights layer because a digest can preserve disallowed data perfectly. MLCommons says Croissant 1.1 adds machine-actionable provenance, vocabulary interoperability, and governance metadata. That direction is important because AI dataset metadata is not only about loading files. The metadata has to carry usage restrictions, consent signals, and policy context where automation can inspect them.

The rights layer should not overpromise. A manifest field saying rights_basis: internal-use-policy-2026-04 does not prove that every source claim is true. It proves the build had a declared rights basis that reviewers can challenge. Training Data Provenance proves a recorded data story, not the moral or legal truth of every line in that story. That humility keeps crypto commitments from becoming compliance theater.

Transform Layer

Training Data Provenance usually fails in the transform layer. An opt-out list may exist, but the transform may run before the list is joined. A redaction step may exist, but only for English. Deduplication may exist, but may preserve a record through a near-duplicate. Datasheets for Datasets is still relevant because it asks creators to document motivation, composition, collection, preprocessing, uses, distribution, and maintenance. Training Data Provenance turns those questions into build artifacts.

The transform layer needs exact names. clean-data is not enough; pii-redaction-v3 and opt-out-removal-v2 are reviewable. dedupe is not enough; dedupe-v2 threshold=0.84 is reviewable. A Training Data Provenance manifest should make it possible to reproduce the question, "Did the exclusion step run after the opt-out list was loaded?"

Build Link

Training Data Provenance closes the incident only when the model build links back to the repaired manifest. The build record should contain code version, dataset manifest hash, training configuration, evaluation set, model artifact hash, and reviewer status. Without that link, the repaired manifest is just a document near the model, not part of the model's evidence chain.

The Data Provenance Initiative paper shows why this boring link matters: popular AI datasets can have missing, inconsistent, or unclear licensing and attribution metadata. Training Data Provenance should assume metadata is imperfect and preserve uncertainty explicitly. A field called unresolved_risks is not a weakness; it is the part of the receipt that tells the next reviewer where not to overclaim.

Training data manifest diff

Reviewer Status

Training Data Provenance needs reviewer status because automation cannot own every judgment. The reviewer should be able to mark a source as accepted, rejected, quarantined, or accepted with limits. The difference matters. Accepted means the source is fit for the declared use. Accepted with limits means the model owner must carry a caveat into the model card. Quarantined means the data should not enter training until a condition is resolved.

The reviewer status is also the first line a future incident responder should read. If the disputed record came from a source marked accepted with limits, the responder knows the risk was known. If the source was never reviewed, the responder knows the process failed. Training Data Provenance should make missing review state visible, not hide it behind a final archive digest.

Quarantine Step

Training Data Provenance should include a quarantine step when the manifest cannot answer a user-facing incident. Quarantine does not mean the entire model must be deleted immediately; it means the disputed source cannot be used for new training until the missing source, rights, transform, and reviewer fields are resolved. That step changes the operational posture from "we have a hash" to "we know which evidence is missing."

The quarantine record can be small, but it must be separate from the repaired manifest:

quarantine:
  source_record: support-chat-export@2026-05-22
  trigger: opt-out record found in model-review sample
  blocked_use: new training and benchmark publication
  allowed_use: incident reproduction in restricted environment
  exit_condition: opt-out-removal evidence and reviewer status attached

Enter fullscreen mode Exit fullscreen mode

Training Data Provenance uses quarantine to prove restraint, not truth. The record proves the team stopped treating an underexplained dataset as clean. It does not prove the disputed record was maliciously included, and it does not prove the repaired pipeline is perfect. The value is narrower and more useful: the next model build cannot quietly reuse the same weak manifest.

The short quarantine step changes the incentives inside the team. Without quarantine, the easiest path is to keep training and promise to document the dataset later, which is how provenance debt becomes permanent. With quarantine, the missing manifest fields become release blockers instead of housekeeping tasks. Training Data Provenance is partly a technical artifact and partly a forcing function: the model owner must either attach the evidence or admit that the dataset cannot support the next build.

Model Card Patch

Training Data Provenance should patch the model card after the manifest is fixed. A model card that only lists dataset_hash=sha256:9f22... invites the same failure later. The patched model card should include the manifest hash, source set, transform set, reviewer status, unresolved risks, and the quarantine history if any. The model card does not need to dump the full manifest into the article; it needs to point to the evidence that explains the hash.

A useful model-card line is specific: training_manifest=sha256:4d81c0ee; reviewer_status=accepted_with_limits; unresolved_risks=legacy tickets before consent-policy migration. That line is not pretty, but it is searchable and reviewable. Training Data Provenance improves when the model artifact carries enough pointers for a future incident responder to reconstruct the data story without Slack archaeology.

Postmortem Close

Training Data Provenance closes this incident with a sentence that is more useful than "we had a hash." The better sentence is: model support-classifier-v7 trained on manifest 4d81c0ee, built from source export and opt-out list dated 2026-05-22, after pii-redaction-v3, opt-out-removal-v2, and dedupe-v2, with reviewer status accepted_with_limits and two unresolved risks. That sentence is long because the real data story is long.

The final lesson is narrow. Training Data Provenance should keep the hash, but the hash is the last line of the receipt, not the receipt itself. A digest catches byte drift. A manifest diff catches process drift. The model owner needs both before the next user asks why their record was in the training set.