Long-running workflows look simple when you first build them.
Something happens.
A few systems exchange data.
Everything completes.
Done.
At least that's the expectation.
Reality is very different.
The biggest thing I underestimated was time.
Not execution time.
Elapsed time.
Because once workflows start running for hours, days, or continuously, strange things start happening.
- APIs become temporarily unavailable
- Data changes halfway through the process
- Retries arrive much later than expected
- Someone manually updates a record
- Another system processes things in a different order
Nothing is broken.
But everything is slightly different from when the workflow started.
Early on, I assumed workflows were transactions.
Start.
Execute.
Finish.
Now I think of them as conversations between systems.
And conversations can get interrupted.
Another thing I underestimated:
State changes.
You might start processing an order that is "pending".
Ten minutes later, another system marks it as "cancelled".
An hour later, a retry comes in from an earlier step.
If your workflow only thinks about data, weird things happen.
Because the world has changed while the process was still running.
Long-running workflows also expose assumptions you didn't know you made.
Like:
- this API will always respond quickly
- data will arrive in order
- users won't modify records manually
- retries will happen immediately
Those assumptions survive in testing.
Production removes them quickly.
One thing that changed how I build these systems:
I stopped asking:
"Will this workflow finish?"
And started asking:
"What state will the world be in when it finishes?"
Because those are two very different questions.
Most problems in long-running systems aren't caused by one big failure.
They're caused by lots of small changes happening while the workflow is still alive.
And if you don't account for that, eventually the workflow finishes successfully and still produces the wrong outcome.
This is something we think about constantly at BrainPack while operating workflows that span multiple systems and AI layers. Long-running processes are less about moving data and more about managing changing state over time.

























