Building Microservices: Lessons From A Decade of Distributed Systems

Real-world insights from designing, implementing, and operating microservices at scale. The mistakes, the victories, and the fundamental principles that still hold true.

Core: Microservices are deceptively simple in theory but extraordinarily complex in practice. After a decade building distributed systems, I’ve learned that architecture decisions matter far less than the governance model supporting them.

The Monolith-to-Microservices Trap

Detail: When we migrated our first large system from a monolith to microservices in 2016, we celebrated prematurely. We’d broken apart a 2 million-line codebase into 40 independent services. The internal documentation bragged about “true separation of concerns” and “independent deployments.” We’d solved the architectural problem. But the organizational problem—how teams coordinate, deploy, and operate these services—nearly broke the company.

Distributed systems double your complexity. You now have network latency, eventual consistency, deployment coordination, and cascading failure modes. The monolith had none of these problems. But it had worse problems: one team’s bad deployment broke everyone, deploys happened at 3 AM for all 200 engineers, and adding features required coordinating across six teams.

Application: Most teams jumping to microservices don’t ask the right question: “What’s our pain point—the architecture or the organization?” If your pain is deployment coordination and team scaling, microservices help. If your pain is general code quality or feature velocity, a better monolith serves you better.

Service Communication: The HTTP vs Message Queue Debate

Core: This debate consumed three years of our architectural discussions. HTTP for synchronous calls feels natural. Message queues for asynchronous work feel safer. The answer is: both, used deliberately.

Detail: We started with HTTP everywhere. Every service spoke REST to every other service. This gave us immediate visibility—you could trace a call through 7 services and understand what happened. But it gave us tightly coupled services. When the payment service degraded at 3 AM, checkout service failed immediately. No retries, no buffering, just hard failures cascading upstream.

We migrated critical paths to message queues. Order creation sent a message to the fulfillment queue. If fulfillment was down, orders queued harmlessly. When fulfillment recovered, it processed the backlog. Much safer. But now we lost visibility. Tracing an order through 5 systems required consulting 5 different message queue logs. Debugging why an order sat in queue for 6 hours meant digging through dead-letter queues and understanding retry logic.

The solution: Use HTTP for queries and lightweight operations (< 100ms, idempotent). Use message queues for data mutations and operations that can tolerate delay. This hybrid approach—which we called “command-query responsibility segregation lite”—balanced visibility with resilience.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Example: Service communication pattern using both HTTP and message queue

from fastapi import FastAPI
from aiokafka import AIOKafkaProducer
import aiohttp
import asyncio

app = FastAPI()
kafka_producer = None

@app.on_event("startup")
async def startup():
    global kafka_producer
    kafka_producer = AIOKafkaProducer(bootstrap_servers='kafka:9092')
    await kafka_producer.start()

@app.on_event("shutdown")
async def shutdown():
    await kafka_producer.stop()

# Query: Fast, synchronous HTTP call for read operations
@app.get("/user/{user_id}")
async def get_user(user_id: int):
    async with aiohttp.ClientSession() as session:
        async with session.get(f"http://user-service:8000/users/{user_id}") as resp:
            return await resp.json()

# Mutation: Async message queue for write operations
@app.post("/orders")
async def create_order(order_data: dict):
    # Publish event to queue instead of waiting for response
    await kafka_producer.send_and_wait(
        "order_created",
        value={
            "user_id": order_data["user_id"],
            "items": order_data["items"],
            "timestamp": asyncio.get_event_loop().time()
        }
    )
    return {"status": "queued"}

# Consumer: Listen for events asynchronously
from aiokafka import AIOKafkaConsumer

async def consume_orders():
    consumer = AIOKafkaConsumer('order_created', bootstrap_servers='kafka:9092')
    await consumer.start()
    try:
        async for message in consumer:
            order_event = message.value
            # Process fulfillment, payment, notifications asynchronously
            print(f"Processing order for user {order_event['user_id']}")
    finally:
        await consumer.stop()

This pattern demonstrates the hybrid approach: HTTP for immediate operations where latency matters, Kafka for mutations that can queue. The consumer processes events independently, allowing the system to handle load spikes gracefully.

Application: Your critical path (checkout, payment, order confirmation) should use HTTP queries + message queue mutations. Non-critical background work (emails, analytics, inventory updates) should purely use queues. This architecture prevents 3 AM pages from one service taking down your entire platform.

Database Per Service: The Consistency Cost

Core: The microservices pattern advocates “database per service”—each service owns its schema, no shared databases. This sounds elegant. In practice, it trades schema complexity for operational complexity.

Detail: We implemented this religiously. The order service owned order_status table. The payment service owned payment_transactions. The fulfillment service owned shipment_tracking. Each team could change their schema without coordinating. Deployments became independent. This was the dream.

Then we needed a report: “Show me all orders from January with their payment method and current shipment status.” This required joining three separate databases, running transactions against three different SQL servers, and managing eventual consistency. What took 5 minutes as a single JOIN query now took 50 lines of orchestration code and 6 calls across services.

Eventual consistency broke debugging. A customer reported an order stuck in fulfillment but marked as paid. In a monolith, the audit trail is a single transaction log. With separate databases, we investigated three different logs and still couldn’t find where the contradiction originated. The consistency boundary was implicit and distributed across networks and message queues.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Example: Querying data across three separate services

import aiohttp
import asyncio
from typing import List, Dict
from datetime import datetime

class OrderQuery:
    def __init__(self):
        self.order_service_url = "http://order-service:8000"
        self.payment_service_url = "http://payment-service:8000"
        self.fulfillment_service_url = "http://fulfillment-service:8000"

    async def get_order_summary(self, order_id: int) -> Dict:
        """
        Fetch order data from three separate microservices.
        This demonstrates the complexity of querying across database boundaries.
        """
        async with aiohttp.ClientSession() as session:
            # Fetch order details
            order = await self._fetch(session, 
                f"{self.order_service_url}/orders/{order_id}")
            
            # Fetch payment info (might not exist yet - eventual consistency)
            payment = await self._fetch(session,
                f"{self.payment_service_url}/payments?order_id={order_id}")
            
            # Fetch fulfillment status
            fulfillment = await self._fetch(session,
                f"{self.fulfillment_service_url}/shipments?order_id={order_id}")
            
            # Reconcile potentially inconsistent data
            return self._reconcile(order, payment, fulfillment)

    async def _fetch(self, session, url: str):
        try:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=3)) as resp:
                if resp.status == 200:
                    return await resp.json()
                return None
        except asyncio.TimeoutError:
            # Service timeout - return stale data or None
            return None

    def _reconcile(self, order: Dict, payment: List[Dict], fulfillment: List[Dict]) -> Dict:
        """
        Reconcile data from three separate databases.
        Must handle eventual consistency: payment might not exist yet,
        fulfillment might be ahead/behind payment status.
        """
        return {
            "order_id": order["id"],
            "status": order["status"],
            "amount": order["total"],
            "payment_status": payment[0]["status"] if payment else "pending",
            "shipment_status": fulfillment[0]["status"] if fulfillment else "not_started",
            # This could be inconsistent! Order marked shipped but payment still pending
            "consistency_warnings": self._check_consistency(order, payment, fulfillment)
        }

    def _check_consistency(self, order: Dict, payment: List[Dict], fulfillment: List[Dict]) -> List[str]:
        """Flag potential consistency violations"""
        warnings = []
        if fulfillment and payment:
            if fulfillment[0]["status"] == "shipped" and payment[0]["status"] != "completed":
                warnings.append("Order shipped but payment not completed")
        if not payment and order["status"] == "completed":
            warnings.append("Order marked complete but no payment record")
        return warnings

This code shows the reality: querying across services means multiple network calls, handling partial failures, and reconciling eventually consistent state. A simple report becomes operational complexity.

Application: Separate databases buy you deployment independence at the cost of transactional consistency. This trade is worth it for services that rarely coordinate (user profiles, product catalogs). It’s terrible for services that must stay consistent (orders, payments, inventory). Consider polyglot persistence: monolithic database for tightly coupled domains, separate databases for loosely coupled services.

Operational Lessons: Logging, Tracing, Monitoring

Core: Distributed systems require distributed observability. Without it, you’re flying blind.

Detail: We learned this painfully. An API endpoint started returning 500 errors every 10 minutes. In a monolith, we’d grep the logs. In microservices, the error could originate from any of 15 services. The 500 response came from the API gateway, but the root cause was buried in the database connection pool of a background worker service. Without cross-service tracing, we wasted 2 hours debugging.

We implemented distributed tracing (Jaeger) and never looked back. Every request gets a trace ID. That trace flows through 8 services, and each service logs with that trace ID. When an error occurs, we query traces for that user/order/request and see the entire execution path. It’s visibility that would be impossible to build manually.

Structured logging matters equally. Printf debugging in a distributed system means grepping across 15 servers. Structured JSON logs with trace IDs mean querying Elasticsearch with “trace_id: xyz AND level: error” and seeing the exact sequence of events.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Example: Distributed tracing configuration with OpenTelemetry

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: default
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        send_batch_size: 1024
        timeout: 10s
      
      # Add resource attributes to all traces
      resource:
        attributes:
          service.namespace: production
          cluster.name: us-east-1

    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
      
      # Also send metrics to prometheus
      prometheus:
        endpoint: "0.0.0.0:8889"

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch, resource]
          exporters: [jaeger]
        
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]

This configuration collects traces from all services, batches them for efficiency, adds resource context (which service, which cluster), and exports to Jaeger. Every service sends traces with consistent trace IDs. Debugging becomes “query Jaeger for this user’s requests” instead of grepping across servers.

Application: Implement distributed tracing before you have a 3 AM incident. When your payment service mysteriously processes the same order twice, you’ll want to see the exact sequence of events across 8 services. Jaeger or similar (DataDog, New Relic) isn’t optional—it’s the only way to operate microservices reliably.

The Hardest Lesson: Organizational Structure Follows Architecture

Core: Conway’s Law states that software systems mirror the communication structure of the organizations that build them. I learned this wasn’t a clever observation—it was a fundamental constraint.

When we organized teams by microservice, they became territorial. The orders service team defended their schema fiercely because changing it required coordinating with 5 other teams. Shared libraries became bloated as each team added their own requirements. The architecture we chose forced an organizational structure that made adding features slower, not faster.

We reorganized by business capability instead (checkout, fulfillment, payments). Teams still owned services but shared ownership of libraries and infrastructure. Features that spanned services required cross-team planning, which was inconvenient but rare. Most work stayed within capability boundaries. Velocity increased because we optimized for the common case, not edge cases.

Application: If your teams are siloed by service, your architecture will amplify that. Consider whether your organizational structure drives or hinders the communication your architecture requires. Sometimes the best architecture is the one that matches your existing team topology, even if it’s not architecturally perfect on paper.

Final Reflection

Twenty-five years of software engineering taught me that microservices aren’t inherently better than monoliths. They’re different tools for different constraints. The trap isn’t building microservices—it’s building them before understanding why you need them, and then spending a decade paying the operational cost.

Start monolithic. Move to microservices when one of these is true: (1) deployment independence is critical, (2) you have multiple teams working simultaneously, (3) you need to scale different components independently. Until then, a well-structured monolith beats a poorly structured microservices ecosystem every time.


Hero Image Prompt: “Distributed system architecture diagram showing multiple connected microservices (payment, orders, fulfillment, users) with message queues and HTTP connections. Central visualization showing trace flow with different colors for each service. Include monitoring dashboard overlay (Prometheus/Jaeger style graphs). Dark professional theme with navy blue (#1a1a2e) and cyan accent lines. Technical, professional, architectural visualization style.”