Retry logic, Kafka consumer lag, and the hidden failure pattern that Kubernetes won’t catch

Retries are one of those features that almost every distributed system eventually gets.

Downstream timeout?

Retry.

Temporary network issue?

Retry.

Intermittent dependency failure?

Retry.

The logic makes sense.

But here’s a question:

What happens when retries start generating more traffic than your users?

That sounds strange at first.

But in cloud-native payment systems, retries can become one of the fastest ways to amplify degradation.

Let’s walk through a realistic scenario.

⸻

The architecture

Consider a representative payment workflow:

API Gateway
↓
Payment Service
↓
Fraud Service
↓
Ledger Service
↓
Kafka
↓
Notification Service

Typical stack:

Spring Boot microservices
Kafka event communication
Kubernetes
Redis
PostgreSQL / Oracle
Resilience4j
HikariCP

Looks straightforward.

⸻

The “safe” configuration change

Suppose intermittent downstream failures appear.

Someone increases retries:

resilience4j:
 retry:
   instances:
      fraudService:
         maxRetryAttempts: 10
         waitDuration: 100ms

Originally:

maxRetryAttempts: 3

No redesign.

No architecture changes.

Just more retries.

Seems harmless.

⸻

Now introduce latency

Fraud Service latency increases:

50ms → 4s

Not failure.

Latency.

Pods remain healthy.

Readiness probes pass:

readinessProbe:
   httpGet:
      path: /actuator/health
      port:8080

CPU remains normal.

HPA sees:

averageUtilization: 70

No scaling event.

Everything looks healthy.

⸻

But hidden pressure begins building

Payment Service threads begin waiting:

CompletableFuture<ScoreResponse> score =
fraudClient.getScore(request);

Threads remain occupied longer.

Consumers process records slower.

Kafka offsets stop advancing.

Retries kick in.

Traffic multiplies.

What started as:

100 requests

can become:

100 requests

retries
retry retries
downstream calls

No new customers arrived.

The system generated extra load itself.

⸻

The propagation chain

Fraud latency
↓
Retry amplification
↓
Thread saturation
↓
Kafka consumer lag
↓
HikariCP exhaustion
↓
Authorization failures

This is why retries can become traffic generators.

⸻

Kafka consumer lag was probably the first warning

Many teams watch:

CPU
memory
pod count

But Kafka consumer lag often moves first.

Example:

records-lag-max

Prometheus alert:

- alert: HighConsumerLag
  expr: kafka_consumergroup_lag > 1000
  for: 2m

Consumer lag frequently appears before users experience failures.

⸻

Add timeout boundaries

Retries without timeout boundaries become dangerous.

Resilience4j:

resilience4j:
 timelimiter:
   instances:
      fraudService:
         timeoutDuration: 500ms
 retry:
   instances:
      fraudService:
         maxRetryAttempts: 3

Retries should stop.

Not multiply indefinitely.

⸻

Add bulkheads

Separate downstream resource pools:

resilience4j:
 thread-pool-bulkhead:
   instances:
      fraudService:
          coreThreadPoolSize: 5
          maxThreadPoolSize:10

Now Fraud Service degradation cannot consume all resources.

⸻

Add replay-safe idempotency

Retries + Kafka replay can create duplicate transactions.

Redis protection:

String key=
"txn:"+event.getTransactionId();
Boolean first=
redisTemplate
.opsForValue()
.setIfAbsent(
key,
"1",
Duration.ofHours(24)
);
if(Boolean.FALSE.equals(first)){
   return;
}

Without idempotency:

duplicate ledger updates become possible.

In payment systems that becomes expensive.

⸻

Final takeaway

Retries still matter.

They’re useful.

But retries are not just recovery mechanisms.

They’re traffic generators.

When systems degrade, retries create additional work.

Additional work creates pressure.

Pressure creates propagation.

And propagation creates transaction failures.

The tricky part?

Kubernetes may never notice.

推荐订阅源

DEV Community