Every failed microservices adoption I have seen made the same mistake: treating microservices as an infrastructure pattern instead of an organisational one. The technology is the easy part. The hard part is everything else.
The microservices conversation in most engineering teams goes something
like this. The monolith is getting unwieldy. Deployments are slow.
The codebase is hard to navigate. A senior engineer proposes breaking
things apart into services. The team agrees. They spend six months
doing it. Things get worse.
The services are too small or too large. Nobody agrees on where the
boundaries should be. Simple features now require coordinating changes
across three repositories. A bug that used to take twenty minutes to
debug now takes two hours because it crosses service boundaries.
Deployments are more frequent but individually more fragile. The team
is working harder than before and moving slower than before.
They blame the technology. The technology is not the problem.
Microservices failed for them for the same reason they fail for most
teams that adopt them: the team treated decomposition as a technical
decision and ignored the organisational reality that microservices
are actually designed to solve. You cannot separate the architecture
from the team structure. Conway's Law is not optional.
What microservices were actually invented for
Amazon is the origin story most people know. The mandate from Jeff
Bezos in the early 2000s: all teams will expose their data and
functionality through service interfaces. All teams will communicate
through these interfaces. No other form of interprocess communication
is allowed. Anyone who doesn't do this will be fired. He was serious.
The problem Bezos was solving was not technical. Amazon's engineering
organisation had grown to a size where teams were deeply entangled with
each other. Team A could not deploy without coordinating with Team B,
which needed sign-off from Team C, which had a dependency on Team D.
Every change required a synchronisation meeting. Every release was a
negotiation. The coupling between teams was strangling the organisation's
ability to move.
Services were the solution because they forced a contract between teams.
If Team A owns Service A, and Team B owns Service B, and they communicate
only through a defined API, then Team A can change anything inside
Service A without asking Team B's permission. Team B can deploy Service B
on its own schedule. The organisational autonomy is enforced by the
technical boundary.
This is the thing that most microservices adoptions miss entirely.
Services are not primarily a way to scale technology. They are a way
to scale teams. The technical properties of services (independent
deployment, technology flexibility, fault isolation) are valuable
side effects of the organisational property (team autonomy).
If you adopt microservices without the organisational changes that
make them valuable, you get all the costs of distributed systems
and none of the benefits.
The cost of distribution
A monolith, for all its problems, has properties that distributed
systems do not have and cannot have.
A function call inside a monolith is reliable. Either it works or
it throws an exception. It is fast. It completes in microseconds.
It participates in the same database transaction as the code that
called it. If the whole operation needs to be rolled back, it is.
A network call between services is unreliable. It might succeed.
It might fail. It might succeed on the server side and fail on the
network before the response reaches the caller. It might time out,
leaving you with no information about whether the remote operation
completed. It is slow relative to a function call. It crosses
a transaction boundary, which means if something fails after the
call succeeded, you have a consistency problem that cannot be
resolved by a rollback.
This is not an implementation detail to be engineered around. It is
a fundamental property of distributed systems, described precisely
in the fallacies of distributed computing that Peter Deutsch wrote
in 1994 and that the industry has been rediscovering ever since.
The network is not reliable. Latency is not zero. Bandwidth is not
infinite. The network is not secure. Topology changes. There is not
one administrator. Transport cost is not zero. The network is not
homogeneous.
Every service boundary you add to your system is a place where these
fallacies apply. Every service call is an opportunity for latency,
failure, and consistency problems that simply do not exist inside
a monolith. The question is whether the organisational benefits of
the boundary justify the distributed systems cost of maintaining it.
For a team of eight people working on one product, they almost never do.
Where the boundaries actually belong
The most common failure mode in microservices adoption is drawing
service boundaries around technical concerns rather than business
ones. Teams create an "auth service," a "notification service,"
a "user service," a "payment service." These feel like natural
decompositions because they map to recognisable technical concepts.
They are terrible service boundaries.
An auth service that every other service must call to validate a
token is not a service. It is a shared library that has been deployed
as infrastructure, adding network latency and a failure mode to every
authenticated request in the system. If the auth service is slow,
everything is slow. If the auth service is down, everything is down.
You have taken a piece of logic that could live as a function call
and made it a distributed systems problem.
A notification service is not a service. It is a collection of side
effects that have been externalized, creating a situation where the
service that wants to send an email must make a network call, handle
the failure case, and figure out what to do if the notification
service is unavailable at the moment the email needs to be sent.
The boundaries that work are the ones that map to bounded contexts
in the business domain. Not "the thing that handles auth" but "the
thing that owns everything about how customers interact with our
platform." Not "the thing that sends notifications" but "the thing
that owns the customer communication history and all the rules about
when and how to communicate."
These boundaries are harder to identify. They require understanding
the business deeply enough to know where the real seams are. They
require conversations with product managers and domain experts, not
just with engineers. They change as the business evolves. But they
are the boundaries that, when you respect them, produce services
that teams can own autonomously and evolve independently.
Domain-Driven Design's concept of bounded contexts is the clearest
framework for finding these boundaries. The bounded context defines
the scope within which a particular domain model applies. At the
edge of the bounded context, the model changes. That is where the
service boundary belongs.
# A service boundary drawn around a technical concern.
# Every other service calls this. Auth is now a distributed dependency.
#
# Bad:
class AuthService:
async def validate_token(self, token: str) -> User:
...
async def create_token(self, user_id: str) -> str:
...
async def revoke_token(self, token: str) -> None:
...
# A service boundary drawn around a business capability.
# This service owns everything about an order, including its auth context.
# Other services don't call into it for auth. They communicate
# through events when they need to know something happened.
#
# Better:
class OrderService:
async def place_order(self, customer_id: str, items: list) -> Order:
# Auth context is resolved here, not farmed out to a network call
customer = await self.customer_repository.get(customer_id)
if not customer.can_place_orders():
raise InsufficientPermissionsError()
...
async def cancel_order(self, order_id: str, requesting_customer_id: str) -> None:
order = await self.order_repository.get(order_id)
if order.customer_id != requesting_customer_id:
raise InsufficientPermissionsError()
...
Conway's Law is a constraint, not a suggestion
Mel Conway observed in 1967 that organisations produce systems that
mirror their communication structures. A team with three groups will
produce a system with three components. This is not because they
planned to. It is because the system reflects who talks to whom.
The implication that most teams don't fully absorb: if you want a
particular system architecture, you need the corresponding
organisational structure. You cannot have a microservices architecture
with a team structure designed for a monolith. The organisation will
fight the architecture until one of them wins, and the organisation
usually wins because it existed first.
This is why Amazon's microservices worked. The service boundaries
and the team boundaries were the same boundaries. Team A owns Service A.
Not "Team A and Team B both contribute to Service A." Not "Service A
is maintained by whoever has time." One team, one service, full ownership.
The organisational autonomy and the technical autonomy were the same thing.
Most microservices adoptions separate these. The same team that used
to work on the monolith now works on six services. They have all the
coordination overhead of distributed systems and none of the team
autonomy that makes it worth it. They still talk to each other constantly
because they're the same people. The service boundaries don't reflect
team boundaries because there are no team boundaries. There is one team
doing distributed systems for no organisational reason.
The inverse Conway maneuver, a term coined by Thoughtworks, is the
deliberate version: you design the team structure you want, then
let the architecture follow from it. If you want a payments service
that can be developed and deployed independently, you need a payments
team that can make decisions and ship code independently. If you do
not have or cannot create that team, you do not have the prerequisite
for the payments service.
The prerequisite check before splitting a service:
Who will own this service?
"The backend team" is not an answer.
A named, stable, small team is an answer.
Can that team deploy the service without coordinating with
other teams?
If not, the boundary is wrong or the ownership is wrong.
Can that team change the service's internal implementation
without changing any other service?
If not, the boundary is wrong.
Is there a defined contract (API, event schema) between this
service and its consumers?
If not, you don't have a service. You have a distributed module.
Does the team have enough context about the business domain
this service represents to make good decisions autonomously?
If not, the team needs to exist and stabilise before the service
should be extracted.
If any of these answers is no, the split is premature.
The operational surface nobody accounts for
When a team decides to split their monolith into ten services, they
usually have a plan for the technical decomposition. They rarely have
a plan for what they are about to own operationally.
A monolith has one deployment pipeline. One set of infrastructure
to configure. One place to look at logs. One set of metrics. One
runbook for when things go wrong. The operational complexity is low.
Ten services have ten deployment pipelines. Ten infrastructure
configurations. Log aggregation that spans services. Distributed
tracing to follow a request through multiple services. Ten runbooks,
except the incidents that matter will involve multiple services and
none of the runbooks will cover that. Service discovery. Health
checking at the inter-service level. Circuit breakers for when
downstream services are degraded.
None of this complexity is impossible to manage. It is all solvable.
But it requires a team that has the capacity to manage it, tools
that have been set up before the split happens, and expertise that
takes time to develop.
Most teams split their services and then build the operational
infrastructure retroactively, while also trying to deliver product
work, while also debugging the new distributed systems problems they
did not have before. This is where the eighteen months of slowdown
comes from.
The teams that do this well build the operational infrastructure
first. They get distributed tracing working in the monolith before
they split it. They standardise their deployment pipeline before
they have ten of them. They establish logging conventions before
they have ten services emitting logs in subtly different formats.
# The operational baseline that must exist before splitting services.
# This is not optional infrastructure to add later.
# Centralised structured logging
logging:
format: json
fields:
service: ${SERVICE_NAME}
version: ${SERVICE_VERSION}
environment: ${ENVIRONMENT}
trace_id: ${TRACE_ID} # Must be propagated across service calls
span_id: ${SPAN_ID}
# Every service exposes these endpoints. No exceptions.
health_endpoints:
liveness: /healthz # Is the process running?
readiness: /ready # Is it ready to serve traffic?
metrics: /metrics # Prometheus metrics
# Every inter-service call propagates these headers
trace_propagation:
headers:
- traceparent # W3C Trace Context
- tracestate
# Every service has these alerts configured before it handles traffic
minimum_alerts:
- error_rate_above_1_percent
- p99_latency_above_1_second
- service_unavailable
The monolith that should stay a monolith
Not every system should be microservices. This is easy to say and
hard to accept in an industry where microservices became the mark
of a serious engineering organisation.
The monolith that should stay a monolith is the one where:
The team is small enough that coordination overhead is low. Five
to eight engineers can coordinate in a daily standup without the
synchronisation cost becoming significant. For a team this size,
the organisational problem that microservices solve does not exist.
The domain is not yet well understood. Early-stage products have
unstable domain models. The concepts that seem fundamental change
as you learn what you're actually building. Service boundaries drawn
around an unstable domain model have to be redrawn as the domain
stabilises, which is expensive and demoralising. The monolith lets
the domain model evolve cheaply. Split when the domain is understood.
The operational team does not exist. If nobody owns the infrastructure
that a distributed system requires, the system will be operated badly.
A well-operated monolith beats a poorly-operated distributed system
every time.
The internal structure can be improved without splitting. A modular
monolith with clear internal boundaries, enforced through package
structure and dependency rules, provides most of the cognitive benefits
of microservices (clear ownership, bounded contexts, interface discipline)
without the distributed systems cost. It is not a compromise. For the
right team and domain, it is the correct architecture.
# A modular monolith with enforced boundaries.
# orders/ cannot import directly from payments/.
# They communicate through defined interfaces.
# This is achievable without distributed systems.
# src/orders/service.py
from orders.repository import OrderRepository
from orders.events import OrderPlaced # Orders emits events
# from payments.service import PaymentService # This import is forbidden
# enforced by linting rules
class OrderService:
def __init__(
self,
repository: OrderRepository,
event_bus: EventBus,
payment_gateway: PaymentGateway, # Interface, not concrete payments module
):
self.repository = repository
self.event_bus = event_bus
self.payment_gateway = payment_gateway
async def place_order(self, customer_id: str, items: list) -> Order:
order = Order.create(customer_id=customer_id, items=items)
await self.repository.save(order)
await self.event_bus.publish(OrderPlaced(order_id=order.id))
return order
# The payments module listens for OrderPlaced events.
# It never gets called directly by orders.
# The boundary is real. It is enforced by design, not by a network.
# src/payments/handlers.py
from orders.events import OrderPlaced # Reading event schema is allowed
class PaymentEventHandler:
async def on_order_placed(self, event: OrderPlaced) -> None:
await self.payment_service.initiate_payment(order_id=event.order_id)
This is a real architecture that scales further than most teams
think before the overhead of splitting services becomes worth paying.
Shopify ran a version of this for years. Stack Overflow still does.
They are not unsophisticated organisations.
What the good teams understand
The teams that have figured out distributed systems share a
perspective that took most of them several years and at least one
failed microservices adoption to arrive at.
Services are not about code organisation. They are about team
organisation. A service boundary that does not correspond to a team
boundary is overhead without benefit.
The overhead of distributed systems is real, permanent, and
compounding. You pay it forever. It needs to buy something worth
having. For a team that is too large to coordinate, team autonomy
is worth having. For a team that is not yet at that size, it is
not.
The correct direction of reasoning is: we have an organisational
problem, what architecture solves it? Not: we have an architecture
trend, what organisation do we need to adopt it?
Microservices adopted as a technical decision produce the costs
of distribution and the politics of boundary negotiation without
the autonomy that makes them valuable. Microservices adopted as
an organisational decision, by teams that have done the work of
defining ownership and building operational foundations, produce
systems that actually deliver what the pattern promises.
The technology has never been the hard part.
The hard part is everything the technology forces you to sort out first.
Most teams skip that part and wonder why the technology failed them.

























