We Trusted Auto-Ack. The Queue Agreed. Our Costs Didn't.

Most async bugs announce themselves. This one didn't.

No failed jobs. No customer complaints. No error logs. Just infrastructure costs climbing steadily with no obvious cause. It took correlating message IDs across logs to finally see it: the same message being processed two, sometimes three times per delivery.

The culprit was a race condition hiding inside an acknowledgment pattern.

What Happened

A consumer picked up a message and started doing work. That work took time. Before it finished, the queue's retry timeout fired, assumed failure, and redelivered the message to a second consumer. Now two workers were doing identical work concurrently, both completing successfully, both silently doubling the cost.

The system looked healthy by every normal metric. It just wasn't.

The Fix

One configuration change.

Python

# The problem
channel.basic_consume(queue='jobs', on_message_callback=process, auto_ack=True)

# The fix
def process(ch, method, properties, body):
    do_the_work(body)
    ch.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(queue='jobs', on_message_callback=process, auto_ack=False)

Java (Spring AMQP)

// The problem
@RabbitListener(queues = "jobs", ackMode = "AUTO")
public void process(String message) {
    doTheWork(message);
}

// The fix
@RabbitListener(queues = "jobs", ackMode = "MANUAL")
public void process(String message, Channel channel, @Header(AmqpHeaders.DELIVERY_TAG) long tag)
        throws IOException {
    doTheWork(message);
    channel.basicAck(tag, false);
}

Acknowledge after the work completes, not when the message arrives.

The Real Blindspots

This pattern shows up in any async system. Three things that hide it.

Auto-ack tells the queue you are done before you are. With auto-ack enabled, the queue marks the message delivered the moment your consumer receives it. If your worker takes longer than the visibility timeout to finish, the queue sees an unacknowledged message, assumes failure, and redelivers it. A second consumer picks it up and starts the same work. Both complete. Both looked successful. Neither knew about the other.

Manual acknowledgment closes this gap. The queue does not consider the message done until your code explicitly says so, after the work is genuinely finished.

Timeout values set for ideal conditions. When your worker runs slow due to load, cold start, or external API lag, the queue retries before you finish. Even with manual ack, if your visibility timeout is shorter than your worst-case processing time, you will see the same duplicate behavior. Your timeout needs to reflect worst-case latency, not average.

Idempotency masking the problem. If duplicate work produces the same result, nothing breaks visibly. No errors, no data corruption, just silent duplicate calls. The cost climbs and nothing alerts you. This is exactly why the bug survived as long as it did.

The Checklist

Before shipping any async worker:

Manual acknowledgment only. Ack after completion, never on receipt.
Timeout values account for worst-case latency, not average.
Every message has a correlation ID traceable across all consumers.
Worker operations are idempotent and safe to run twice.
You are monitoring work volume, not just queue depth.

The Learning

The queue delivered the message successfully. That is not the same as the work being done once.

推荐订阅源