惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

The Register - Security
The Register - Security
美团技术团队
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
Jina AI
Jina AI
C
Check Point Blog
aimingoo的专栏
aimingoo的专栏
I
InfoQ
S
Securelist
T
Tor Project blog
GbyAI
GbyAI
L
LINUX DO - 热门话题
V
Visual Studio Blog
AWS News Blog
AWS News Blog
The Cloudflare Blog
腾讯CDC
K
Kaspersky official blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recorded Future
Recorded Future
李成银的技术随笔
W
WeLiveSecurity
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
M
Microsoft Research Blog - Microsoft Research
G
Google Developers Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Schneier on Security
Schneier on Security
B
Blog
IT之家
IT之家
爱范儿
爱范儿
H
Help Net Security
Simon Willison's Weblog
Simon Willison's Weblog
NISL@THU
NISL@THU
J
Java Code Geeks
博客园 - 聂微东
T
The Exploit Database - CXSecurity.com
Cyberwarzone
Cyberwarzone
博客园 - 叶小钗
MyScale Blog
MyScale Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Project Zero
Project Zero
F
Future of Privacy Forum
D
Darknet – Hacking Tools, Hacker News & Cyber Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Hacker News: Ask HN
Hacker News: Ask HN
D
Docker
Apple Machine Learning Research
Apple Machine Learning Research
B
Blog RSS Feed
V
Vulnerabilities – Threatpost

DEV Community

Breaking the Monorepo Barrier in a Crypto Store for Digital Products Imposter Syndrome Is Something We All Struggle With at Some Point in Our Careers Moving Beyond the Black Box: How I Built a Real-Time Voice Fitness Coach using Next.js 15, Convex, & Vapi.ai From Spec-Driven Development to Attractor-Guided Engineering Githubster free tool to track your GitHub followers and unfollowers Why Bitcoin Core RPC is Too Slow for High-Frequency Trading (And How to Fix It) Why Reading Food Labels Shouldn't Feel Like Decoding a Chemistry Exam I built a "brain" for AI coding agents — it never forgets and never stops How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration) Controlling Employee AI Usage on Managed Devices: Browser Controls, Cloudflare AI Gateway, and AWS Bedrock When Global Payment Gateways Fail, Local Solutions Shine LeetCode Solution: 13. Roman to Integer End-to-End Observability for vLLM and TGI: from DCGM to Tokens LeetCode Solution: 12. Integer to Roman 🚀 A Beginner’s First Look at Project IDX: Secure Coding from Day One Team Topologies for DevOps: A Practical Implementation Guide Seven Contradictions Shaped an Architecture. Telemedicine in Venezuela: A Technical Guide for Clinics in 2026 SSO, SAML, OIDC, and SCIM: What Actually Happens When You Click "Sign in with Google" Mastering Next.js 16 Server Actions & Forms: The Future of Full-Stack React | Muhammad Arslan Enterprise Laravel API Development: Best Practices for Performance, Security, and Scale | Muhammad Arslan How I Turned an Image Into a 3D Model in Minutes With AI Why Pure Rust WASM Is Harder Than It Looks Platform Stores Are a Dead End for Crypto Payments The VLA Testing Pipeline in Mano-AFK: When AI Agents QA Their Own Work LeetCode Solution: 10. Regular Expression Matching IPv4 Geolocation and Leasing: A Practical Guide for Network Operators Reconciling the Inefficiencies of Global Crypto Payments Platforms I Exported HT-Demucs FT to ONNX in 2026 (4 Blockers Everyone Else Gave Up On) 🤖 The Hacker in the Machine: Using AI Agents to Build Interactive Security Games Savings Plan Amortized Cost in AWS Cost Explorer: What It Is and How to Use It How to Tailor Your Resume to a Job Description in 5 Minutes (A Method That Actually Works) Flutter vs React Native in 2026: I Built the Same App in Both JWT vs Session Tokens in Spring Boot: A Senior Dev's Decision Guide How to Choose an AI Gateway in 2026 How to Teach Source Evaluation When Your Students Use ChatGPT Why Passwordless B2C Rollouts Stall at 5% (and How to Reach 60%) Rmux Review: Rust Terminal Multiplexer Built for AI Agents I realized I was only using half of what Claude Code has to offer DevOps & Deployment Essentials: Your Practical CI/CD Guide How next-generation captchas work and why it matters for automation Chat is Dead: How JSON Prompting Cut My AI Costs by 73% What if Everybody Were Suddenly... Better? OCI Web Application Firewall (WAF) Deep Dive: Architecture, Traffic Inspection, Threat Protection, and Enterprise Security Design Selling Digital Products in a Country PayPal Refuses to Touch PostgreSQL backup tool Databasus released backup verification in real database Docker containers We Connected an LLM to a 12-Year-Old Codebase. Here's What Broke. The Fallacy of Digital Platforms: Why Stripe Isn't Always King Sizce Google'ın 26 Mayıs tarihinde arama bölümünü tamamen yapay zekaya devredecek olması açık webin devamı için nasıl sonuçlanır? When Should You Use GraphRAG Instead of RAG? Big Data Is Not Just About “Huge Data” The Prefix Bubble MPP TestKit VSCode Extension - Inline HTTP 402 Payment Flow Hints The README Was a Protocol. The Entrypoint Was Still Optional. After AI Healthcare, Medical World Models May Be the Next Life-Science AI Platform Your AI Agent Doesn't Need an API Key: Entra Agent ID and Anthropic's Workload Identity Federation ECDSA - The Math That Only Goes One Way S3 Files Killed My Least Favorite Lambda Pattern BNB RPC Endpoints for Production Apps and Backend Workloads I Used to Get Excited About New Tools Now I Feel Tired. Google I/O 2026 — What I Hoped to See Beyond the Model Announcements Most 'AI agents' are just scripts with a marketing budget 🚀 Replicating the evasive VoidLink: My Journey Building Cortex C2 # new stuff dropped in duckkit 🦆 Paying the bills in a restricted country with cryptocurrency: the lie that almost killed our digital product Building Global Economies Through Better APIs: Lessons from PayPal vs Crypto for Crypto Payments in Developing Countries Verified or Not? Ep. 2 — Snyk's Own Test App Scanned With 9 Engines 17 SessionAuth Tools in OpenClaw: Integrate Any AI Framework with Wallet Infrastructure WebMCP and the Citation Paradox — What Agent-Ready Websites Actually Mean for GEO What Gemma 4 Doesn't Know About Cameroon — and What That Taught Me About Building AI for the Real World AI Can Generate Code — And Interactive Coding Playgrounds Are Becoming Essential Modern Web Guidance: Teaching AI Agents to Stop Coding Like It's 2019 The Discipline We Forgot We Had I Built a 3-Agent AI Research Crew in 250 Lines of Python (LangGraph + Free Gemini) PostgreSQL MCP: Let Claude query your databases in plain English Building digital products and Android apps under IteraTrail Fuel Price API for Fleet Cost Planning Linux File System Explained Simply Building a shot-detection worker for an upload pipeline with PySceneDetect 0.7 Wiring VMAF (and PSNR) into your encoder CI with FFmpeg 8.1 and ffmpeg-quality-metrics Bikin Chatbot Sendiri yang Bisa Jawab Pertanyaan dari Dokumen kamu Learning Arabic: Where to Start Shipping WebVTT subtitles in HLS that actually stay in sync (a hands-on guide for 2026) Understanding AI Code Fast: A 60-Second Habit for Institutional Memory Building a Real-Time Camera Classifier Chasing Tokens: The Developer Grind Nobody Warned You About A 10th Grader’s Journey: Why Cyber Security Starts with Your Very First Loop Why Most Developer Portfolios Fail to Show Engineering Maturity Agent Loop and Harness: A Practical Engineering View of AI Operations I built Alpha Insights: AI business research with validators, not just prompts Polygon RPC Endpoints: Free, Dedicated, and Production Options BNB Chain RPC Provider Guide for Production Apps What Is a Nonce in Blockchain? Transaction Nonces Explained Testnet RPC Guide: Sepolia, BNB, Solana Devnet, and More Solana Devnet RPC Guide for Builders and QA Teams How to Choose an RPC Provider for Production Web3 Apps Best Hyperliquid RPC Provider for Low-Latency Apps Best Ethereum RPC API for Web3 Apps and Developers Base RPC Provider Guide for Production Web3 Apps New NPM package to add customizable avatar system for react project
How to Recover Kafka DLQ Messages After a Schema Change Broke Your Consumer
Mohammed Sai · 2026-05-21 · via DEV Community

The Problem Nobody Has a Tool For

Here is the scenario that wakes engineers up at 2am.

Your payment service has been running fine for months. The consumer reads events from a Kafka topic, processes them, sends confirmations. The status field is a plain String - "pending", "failed", "done".

A bug is discovered: the status field accepts invalid values like "ok" and "finished" from different producer teams, causing downstream analytics to break. The correct fix is converting status from String to a strict Enum with only four valid values: PENDING, PROCESSING, COMPLETED, FAILED.

The fix gets deployed at 9:45 PM.

Nobody checked the DLQ first.

There were 23,000 messages sitting in payments.dlq from an earlier consumer failure that afternoon. The new V2 consumer starts processing them. Each one crashes immediately:

InvalidFormatException: Cannot deserialize value of type `PaymentStatus` 
from String "pending": not one of the values accepted for Enum class: 
[PENDING, PROCESSING, COMPLETED, FAILED]

Enter fullscreen mode Exit fullscreen mode

All 23,000 messages are now permanently stuck. The V2 consumer cannot read them. They cannot be redriven as-is. PagerDuty fires at 2:00 AM.


Why Schema Registry Doesn't Help Here

The first suggestion you'll get is "just use Schema Registry."

Schema Registry is a seatbelt. It prevents new schema-incompatible events from being produced. It does nothing for messages already sitting in your DLQ from before the schema change.

Those messages are stored as bytes in Kafka. Schema Registry has no knowledge of them. It cannot transform them. Cannot validate them. Cannot redrive them.

The 23,000 stuck messages are entirely your problem.


What Engineers Actually Do (The Manual Reality)

I posted this exact scenario on r/apachekafka and asked how senior teams handle it. Four experienced engineers replied. Not one named a tool.

Their answers:

BroBroMate (Senior Engineer): "Something I've done in the past is just throw a quick Kafka Streams app up to do mass transformations... And easy to unit test before rolling out."

KTCrisis: "You could also spin up a v2 topic for the new consumers and keep a specific v1 consumer around just to drain the DLQ. But it adds a new topic and a dedicated consumer for a one-shot issue."

Every solution described is either:

  • Building a temporary application from scratch, using it once, deleting it
  • Keeping zombie infrastructure alive until the queue drains
  • Writing a throwaway Python script with no tests, no audit trail, no validation

The time cost: 3-8 hours of a senior engineer's time. At 2am. Every single incident.


The Gap No Existing Tool Fills

I looked at every existing solution:

DLQMan (irori-ab) - routes messages based on exception type headers. No transformation engine. If your messages have the wrong data format, DLQMan cannot help.

Confluent Control Center - has basic redrive. No schema transformation. Requires full Confluent Enterprise stack.

kafka-rewind-tools - handles the redrive part. You still need to write the mutation yourself.

Custom Kafka Streams apps - what everyone builds from scratch, uses once, and throws away.

The gap: no tool handles the transformation step between "message has wrong format" and "message is safe to redrive".


Building DLQ Revive

I spent 5 weeks building the tool that should exist.

DLQ Revive is an open-source Kafka Dead Letter Queue mutation and redrive engine. Here is what it does and the architectural decisions behind each feature.

1. Browse DLQ Messages Safely

// WRONG - what most people do
consumer.subscribe(List.of("payments.dlq"));  // joins consumer group!

// CORRECT - what DLQ Revive does
consumer.assign(List.of(new TopicPartition("payments.dlq", 0)));
consumer.seek(new TopicPartition("payments.dlq", 0), fromOffset);

Enter fullscreen mode Exit fullscreen mode

Using subscribe() joins your application to Kafka's consumer group. Kafka can then trigger a rebalance and assign partitions to your read-only viewer - stealing them from your production consumer. Your 2am debugging tool accidentally takes down your production pipeline.

assign() + seek() bypasses group membership entirely. You read exactly what you want. The production consumer is untouched. DLQ Revive never calls commitSync() in view mode.

Reads are paginated at max 100 messages per API call. No full topic loads. No JVM OutOfMemoryError on topics with 500,000 stuck messages.

2. Transform Schema With JSONata

The transformation step is what makes DLQ recovery possible without writing custom code.

DLQ Revive uses JSONata - a declarative JSON-to-JSON mapping language. To fix the String to Enum problem from our scenario:

{
  "orderId": orderId,
  "amount": amount,
  "currency": currency,
  "status": $uppercase(status),
  "processedAt": $now()
}

Enter fullscreen mode Exit fullscreen mode

One expression. Applies to all 23,000 messages. Preview 5 samples before committing. See the before and after side by side.

Why not Groovy? User-submitted Groovy on your backend can call Runtime.getRuntime().exec("rm -rf /"). It is an RCE vulnerability built into the product if you allow it. JSONata is purely declarative - it has no access to the file system, network, or system commands. Zero RCE surface.

3. Validate Before Redriving

BroBroMate's advice from that Reddit thread was exactly right:

"Whatever way you do it, validate every 'fixed' record against an agreed good schema before producing it - there's nothing worse than dumping another X thousand bad messages on the DLQ."

DLQ Revive validates each transformed message against the target schema before producing. Failed validation shows per-message errors. You see exactly which messages will fail before redriving a single one.

4. Idempotency Guard - Pod-Restart Safe

-- Before producing any message:
SELECT 1 FROM redrive_log 
WHERE topic = ? AND partition = ? AND offset = ?

-- If not found, insert THEN produce:
INSERT INTO redrive_log (topic, partition, offset, redriven_at, redriven_by)
VALUES (?, ?, ?, NOW(), ?)

Enter fullscreen mode Exit fullscreen mode

The redrive_log table has a UNIQUE(topic, partition, offset) constraint.

If your application crashes halfway through redriving 10,000 messages and restarts, it will not reprocess the messages it already produced. In a fintech payment pipeline, that is the difference between a successful recovery and double-charging 5,000 customers.

5. Full Audit Trail

Every browse and redrive action is logged to PostgreSQL with timestamp, user, session ID, message count, and source/target topics. For teams in regulated industries (PCI-DSS, SOC 2), this is the compliance evidence that a manual throwaway script can never provide.


Quick Start

git clone https://github.com/Saifulhuq01/dlq-revive.git
cd dlq-revive
docker compose -f docker/docker-compose.yml up -d
# Open http://localhost:4200

Enter fullscreen mode Exit fullscreen mode

That starts Kafka, Zookeeper, PostgreSQL, the Spring Boot backend, and the Angular dashboard together.

Stack: Java 17, Spring Boot 3.x, Apache Kafka 3.x, Angular 13+, PostgreSQL 15, JSONata, Docker Compose, GitHub Actions CI.


The Kafka Consumer Safety Details

A few things worth understanding if you are building Kafka tooling:

Why assign() instead of subscribe():
subscribe() participates in consumer group coordination. Kafka's group coordinator can reassign partitions at any time during a rebalance. For a read-only tool this is dangerous - you don't want Kafka moving partitions between your tool and the production consumer mid-operation.

assign() with seek() gives you direct partition access. No group membership. No rebalance risk. The offset you seek to is exactly where you start reading. You consume exactly limit records and stop.

Why pagination at the Kafka offset level:
Standard REST pagination (page 1, page 2) doesn't work well for Kafka topics because the "next page" needs to know the exact offset where the previous page ended. DLQ Revive uses cursor-based pagination: fromOffset=0&limit=50 returns messages at offsets 0-49. The next call uses fromOffset=50. This is memory-safe regardless of topic size.

Why no commitSync() in view mode:
Committing offsets in view mode would change the consumer group's position - affecting what the production consumer reads next if they share a group ID. DLQ Revive uses a unique, isolated consumer group and never commits in view mode.


What's Next

The open-source core handles:

  • Paginated DLQ browsing
  • JSONata schema transformation
  • Preview before redriving
  • Idempotency-safe redrive
  • Full PostgreSQL audit trail
  • One-command Docker setup

Free tier limit: 100 messages per redrive session. This is enough for testing and small incidents. Bulk redrive (>100 messages) and team features are in the cloud tier.

GitHub: github.com/Saifulhuq01/dlq-revive

MIT licensed. If you've dealt with DLQ schema recovery at work and want to try it against a real scenario, I'd genuinely value your feedback. Especially on the JSONata approach and whether the safety guarantees hold up under your specific Kafka setup.


Mohammed Saifulhuq - Apache Fineract contributor, building DLQ Revive