惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

The Hacker News
The Hacker News
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
雷峰网
雷峰网
人人都是产品经理
人人都是产品经理
Recent Announcements
Recent Announcements
D
DataBreaches.Net
P
Proofpoint News Feed
V
Visual Studio Blog
J
Java Code Geeks
Recorded Future
Recorded Future
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
F
Full Disclosure
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
The GitHub Blog
The GitHub Blog
Engineering at Meta
Engineering at Meta
C
Cybersecurity and Infrastructure Security Agency CISA
V
Vulnerabilities – Threatpost
罗磊的独立博客
Jina AI
Jina AI
博客园 - 【当耐特】
C
CERT Recently Published Vulnerability Notes
G
GRAHAM CLULEY
Y
Y Combinator Blog
L
LangChain Blog
L
LINUX DO - 热门话题
宝玉的分享
宝玉的分享
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
H
Help Net Security
云风的 BLOG
云风的 BLOG
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园_首页
A
About on SuperTechFans
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Latest news
Latest news
T
Threatpost
T
Tenable Blog
有赞技术团队
有赞技术团队
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Stack Overflow Blog
Stack Overflow Blog
C
Cisco Blogs
C
Check Point Blog
T
Tor Project blog
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
Schneier on Security
美团技术团队
I
Intezer
S
Securelist
AWS News Blog
AWS News Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
The Idempotency Nightmare in AI Pipelines: Data Loss and Recovery
Mustafa ERBA · 2026-05-15 · via DEV Community

Mustafa ERBAY

In this post, I'll share an "idempotency" issue I recently faced in an AI-powered data processing pipeline, which led to both time and data loss, and how I resolved it. I'll try to convey through my own experiences how critical idempotency can be, especially in error scenarios, as it's one of the subtle details we might overlook when building such systems.

What is Idempotency and Why is it Important?

Idempotency means that an operation, when executed multiple times, yields the same result. To explain with a simple example, incrementing a variable's value by 10 is not an idempotent operation; because you get a different result each time you run it. However, setting a variable's value to 0 is an idempotent operation; because no matter how many times you run it, the result will always be 0.

In software systems, especially distributed systems and pipelines involving components like message queues, idempotency is vital. Unexpected situations such as network interruptions, service crashes, or duplicate messages can cause the same request to be processed multiple times. If the processed operation is not idempotent, this can lead to data inconsistency, duplicate records, or unintended side effects.

ℹ️ Technical Depth: Why is Idempotency Critical?

Especially in messaging systems (like Kafka, RabbitMQ) or API calls, when "at-least-once" delivery is guaranteed, there's a possibility of messages or requests being processed multiple times. In such cases, the application must have idempotency mechanisms correctly implemented to tolerate these duplicate processes. Otherwise, for example, if an order creation request is processed twice, two orders might be created, leading to serious financial problems.

The Problem I Faced: Unexpected Duplicates in an AI Pipeline

In a project I was recently working on, I had set up a pipeline that processed user inputs and passed them through a series of AI models. This pipeline took each incoming input, passed it through preprocessing steps, then sent it to different AI models, and finally saved the results to a database. The system had a structure that checked whether each step was successful and retried the relevant step in case of an error.

The problem arose specifically when a candidate failed to get a response from an AI model, and the system retried that step. There was a brief network instability, and the first request reached the model but didn't return a response. Since no response was received, the pipeline marked this step as "failed" and triggered the retry mechanism. On the second attempt, the model successfully responded, and the result was saved to the database. The first request, after the system's retry loop, eventually reached its destination asynchronously in the background and saved the same data again.

⚠️ Real Scenario: Not Data Loss, but Data Duplication!

In this scenario, there wasn't direct data loss, but we encountered duplicate data being saved. If the saved data had a unique key (e.g., a transaction ID), this situation could have led to data integrity issues. Although it didn't seem like data loss, data duplication also severely compromised the pipeline's reliability. We detected that over a period of about 3 hours, more than 100 duplicate records were created when this mechanism was triggered.

Why Wasn't an Idempotency Mechanism in Place?

The oversight of idempotency in such a pipeline was a disappointment for me as well. I believe there were a few primary reasons for this:

  1. Default Trust: Generally, modern services and messaging systems offer delivery guarantees like "at-least-once" or "exactly-once" (though the latter is harder). These guarantees sometimes cause developers to push the reality that they need to handle duplicate processing scenarios to the back burner.
  2. Complexity: Implementing idempotency mechanisms correctly introduces additional complexity, especially in distributed systems. Labeling each step with a unique ID, checking these IDs, and managing states can extend the development process.
  3. Prioritization: At the project's inception, getting the pipeline deployed quickly and ensuring basic functionality were higher priorities. Issues like idempotency, which are considered "edge cases," were listed among topics to be addressed later. However, these so-called "edge cases" are often among the most frequent problems encountered in production environments.

💡 Trade-off: Speed vs. Reliability

This situation highlights a common trade-off in software development: speed and functionality, or long-term reliability and robustness? Often, a balance needs to be struck between the two. In this project, focusing on speed in the first phase necessitated adding such resilience mechanisms in the second phase.

The Solution Process: Integrating Idempotency into the Pipeline

After identifying the problem, I evaluated several different approaches for the solution.

1. Record-Based Uniqueness Control

The first method that came to mind was using uniqueness constraints at the database level. If each piece of data to be saved has a unique identifier (e.g., a request_id or transaction_id), the database can enforce this uniqueness rule and prevent duplicate records.

However, this approach had some limitations:

  • Not Applicable to All Data Structures: Some steps in the pipeline processed intermediate data that wasn't directly saved to the database with a unique key. It wasn't possible to impose a database-level constraint for these steps.
  • Error Messages: When a database uniqueness error occurred, it was necessary to catch this error and communicate it meaningfully to the user or system. This meant additional coding.

2. Application-Level Idempotency Key

A more robust solution was to assign a unique "idempotency key" to each request and track this key through every step of the operation. This key could be a UUID (Universally Unique Identifier) or a custom ID generated by the client.

The workflow should have been:

  1. Request Generation: For each main piece of data entering the pipeline, a unique idempotency_key is generated. This key is passed along with the request into the pipeline.
  2. State Tracking: When each operation step begins, the system stores this idempotency_key and the step it's on in a cache (e.g., Redis) or a dedicated database table.
  3. Duplicate Request Check: If another request arrives with the same idempotency_key, the system first checks if this key has been processed before.
    • If the request was processed successfully before, the operation is not run again, and the previous successful result is returned.
    • If the request was processed but failed, and the retry mechanism was triggered, this situation is managed (perhaps the error is logged, or a different strategy is followed).
  4. Successful Operation: When the operation completes successfully, the idempotency_key's status is updated to "completed."

This approach can be used to prevent duplicate processing at any point in the pipeline.

💡 State Management with Redis

In-memory data stores like Redis are very effective for this kind of state tracking. You can use the idempotency_key as the key and the operation's status (e.g., "processing," "completed," "failed") and perhaps a timestamp as the value. Redis's TTL (Time To Live) feature allows you to automatically clean up old and no longer needed state records. For example, you can ensure an operation is cleaned up after 24 hours.

Implementation Details and Challenges Encountered

I decided to implement this "application-level idempotency key" approach. Here are some details of the process and the challenges I faced:

  • Key Generation: I used Python's uuid.uuid4() function to generate unique idempotency_keys. This generates keys with a high probability of being unique.
  • State Storage: Initially, I considered Redis for storing states. However, I realized that managing individual Redis connections for each service running across different parts of the pipeline would be complex. Therefore, I decided to write to a central data store (in this case, a new table in PostgreSQL) at each step. The table structure was as follows:
CREATE TABLE idempotency_log (
    idempotency_key UUID PRIMARY KEY,
    operation_name VARCHAR(255) NOT NULL,
    status VARCHAR(50) NOT NULL CHECK (status IN ('PROCESSING', 'COMPLETED', 'FAILED')),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

Enter fullscreen mode Exit fullscreen mode

  • Updating Pipeline Steps: Each processing step in the pipeline was updated to use this table. When a step began, a record was first inserted into the idempotency_log table, and the status was set to 'PROCESSING'. When the operation completed, the status was updated to 'COMPLETED', or marked as 'FAILED' in case of an error.
  • Error Handling: The most challenging part was managing error scenarios. If a step was marked as 'FAILED', and the system retried, we needed to set the status of the corresponding record in the idempotency_log table back to 'PROCESSING'. However, this also required knowing why the previous attempt failed. Therefore, keeping track of each attempt's version, rather than just the status, might have been more logical. Consequently, I added a unique attempt_id for each operation and tracked the state by combining the idempotency_key and attempt_id.

🔥 Actual Error: Status Update Issue

In my first attempt, I only kept the status as 'PROCESSING' and 'COMPLETED'. When an error occurred, a new attempt was made. However, if the first attempt failed and the second attempt completed successfully, the record in the idempotency_log table remained 'COMPLETED'. This meant that if a third attempt was made, the system would see it as a duplicate operation and prevent it. To fix this, I needed to either keep a separate record for each attempt or add more information to the status field. Ultimately, I added a unique attempt_id for each operation and tracked the state by combining the idempotency_key and attempt_id.

  • Performance Impact: Performing database queries at each step slightly affected the overall performance of the pipeline. Especially under heavy traffic, it was necessary to ensure that these additional queries did not cause delays by correctly setting up database indexes and optimizing queries. Creating an index on the idempotency_key was critical in this regard.
CREATE INDEX idx_idempotency_key ON idempotency_log (idempotency_key);

Enter fullscreen mode Exit fullscreen mode

Conclusion and Lessons Learned

This experience once again demonstrated the importance of directly sharing my own problems and solutions rather than writing in a corporate consultant tone. Idempotency in AI pipelines is not just a "nice-to-have" feature but a critical requirement that can lead to serious data loss or inconsistency.

The time and effort I spent to resolve this issue showed how costly it can be to overlook idempotency initially. Approximately 8 hours of downtime and over 100 duplicate records taught me that I needed to give this topic more importance.

ℹ️ Future Steps and Improvements

Following this experience, I began developing a more robust idempotency mechanism for each step of the pipeline. This mechanism will record the result of each step and in which attempt it was completed. I will also consider integration with a faster solution like Redis, especially for frequently accessed states. As I mentioned in my previous [related: Asynchronous Operation Management and Debugging] post, debugging and resilience in distributed systems should always be a priority.

I hope this experience will be useful for other developers facing similar issues. It's important to remember that no matter how complex a system becomes, paying attention to fundamental principles is the key to preventing major problems in the long run.