Scaling Databases: When Sharding, Replication, and Caching Become Necessary

Core: Most scaling discussions happen at the wrong scale. Teams implement sharding at 100GB when they should be at 10TB. They add caching layer when the problem is query inefficiency. Scaling decisions are about understanding your bottleneck.

The Replication Story: When One Database Becomes 10

Detail: We built our first system with a single PostgreSQL database. For three years, it handled all traffic. At 2K queries per second, disk space was 200GB, response time averaged 8ms. Then we hit a wall. Each new feature requiring a complex JOIN query would degrade performance by 15%. We’d added 15 features that year, each slightly slower than the before.

The moment of truth: we could either accept 2-3ms feature velocity decrease per new complex query, or architect for scale we hadn’t reached. We chose replication. We set up a read replica, moved all reads there, kept writes on the primary. Response time immediately dropped to 3ms.

Then we needed more reads. We added a second replica, then a third. At six replicas, writes started lagging behind reads. Replication lag meant users would see stale data, leading to race conditions. Fixing that required application-level logic to route writes directly to the primary and handle stale reads gracefully.

By the time we had 12 read replicas, maintaining replication was consuming engineering time. Every backup required careful coordination (can’t backup while replicating). Promoting a replica to primary required testing. We’d solved the read scaling problem but created an operational complexity problem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Example: PostgreSQL replication configuration for read scaling

-- PRIMARY SERVER (write target)
-- postgresql.conf settings
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
hot_standby_feedback = on

-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';

-- REPLICA SERVER (read target)
-- postgresql.conf for replica
primary_conninfo = 'host=primary.internal port=5432 user=replicator password=secure_password'
recovery_target_timeline = 'latest'
hot_standby = on

-- Monitor replication lag
SELECT 
    slot_name,
    restart_lsn,
    confirmed_flush_lsn,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as replication_lag_bytes
FROM pg_replication_slots;

-- Application layer handling read scaling
-- Route writes to primary, reads to replicas with stale read awareness

SELECT current_wal_lsn() FROM pg_current_wal_lsn();  -- Check replica lag on each query

The configuration shows the complexity: PostgreSQL replication requires careful tuning of WAL settings, monitoring of replication slots, and application awareness of replica lag. At 100 queries per second, this is unnecessary. At 10,000 queries per second, it becomes mandatory.

Application: Replication becomes worth considering when: (1) you have more read load than write load (typical 80/20 split), (2) write latency is acceptable (replicas lag behind primary), (3) you can handle eventual consistency. If your system requires strong consistency (financial transactions), replication doesn’t help. If your read load is growing faster than write optimization can handle, replication buys you time.

The Sharding Conversation: When One Database Becomes Many

Core: Sharding is the most complex scaling decision. It’s permanent—you’ll live with this decision for years. Consider it carefully.

Detail: At 1TB of data with 20K queries per second, a single database became a bottleneck. Disk I/O was maxed, CPU was pegged, and we couldn’t optimize queries further. Replication and caching had exhausted their returns. Sharding became inevitable.

We chose hash-based sharding on user_id. Each shard held 1/8 of users. User 1234567 → shard 3. This allowed us to split traffic across 8 database instances and linearly scale capacity. Eight database instances meant 8x the storage, 8x the write capacity, 8x the operational overhead.

But the cost was immense. Every query now needed routing logic. A query like “show me the top 100 users by signup date” required querying all 8 shards, sorting the results, and returning the top 100. Application code became complex with shard-awareness logic.

Resharding was a nightmare. After two years, we needed a ninth shard. Rebalancing data across nine shards meant moving 11% of all user data from existing shards—weeks of background jobs and risky migrations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Example: Hash-based sharding implementation

import hashlib
from typing import List, Dict
import asyncio

class ShardedDatastore:
    def __init__(self, num_shards: int = 8):
        self.num_shards = num_shards
        self.shards = {}  # In reality, these would be separate databases
        for i in range(num_shards):
            self.shards[i] = {}  # Simulated shard storage

    def _get_shard_id(self, user_id: str) -> int:
        """Hash user_id to determine which shard"""
        hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        return hash_value % self.num_shards

    async def get_user(self, user_id: str) -> Dict:
        """Get user from correct shard"""
        shard_id = self._get_shard_id(user_id)
        # Query only the shard that contains this user
        return self.shards[shard_id].get(user_id)

    async def get_top_users_by_signup(self, limit: int = 100) -> List[Dict]:
        """
        Query all shards to find top users by signup date.
        This is expensive - requires hitting all shards.
        """
        all_results = []
        
        # Query all shards in parallel
        tasks = []
        for shard_id in range(self.num_shards):
            tasks.append(self._query_shard(shard_id, "ORDER BY signup_date DESC"))
        
        shard_results = await asyncio.gather(*tasks)
        
        # Merge and sort results
        for shard_result in shard_results:
            all_results.extend(shard_result)
        
        # Sort globally and limit
        all_results.sort(key=lambda x: x['signup_date'], reverse=True)
        return all_results[:limit]

    async def _query_shard(self, shard_id: int, query: str):
        """Simulate querying a single shard"""
        # In reality, this would execute a SQL query against shard_id database
        return self.shards[shard_id]

    def reshard(self, new_num_shards: int):
        """
        Resharding: move data to support more shards.
        This is operationally expensive and risky.
        """
        new_shards = {i: {} for i in range(new_num_shards)}
        
        # Iterate through all existing data
        for old_shard_id, shard_data in self.shards.items():
            for user_id, user_data in shard_data.items():
                # Rehash to new shard number
                new_shard_id = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % new_num_shards
                new_shards[new_shard_id][user_id] = user_data
        
        self.num_shards = new_num_shards
        self.shards = new_shards

Sharding multiplied operational complexity. Before sharding, a backup was straightforward—backup one database. After sharding, backups had to be coordinated across eight databases, and recovery meant knowing which shard failed. Monitoring became difficult—tracking latency required aggregating metrics from eight data stores.

Application: Sharding is valuable when: (1) you’ve maximized single-database performance, (2) you understand your access patterns (most queries are shard-aware), (3) you can live with operational complexity. If you can solve your problem with replication, caching, or better queries, avoid sharding. Resharding is so expensive that you want to get your shard count right the first time—and you almost never do.

The Caching Layer: When 100ms Queries Become 2ms

Core: Caching is often the most misunderstood scaling tool. It’s not about speed—it’s about load distribution. Good caching takes load off the database, not just users.

Detail: Our user profile service was returning 400ms responses because every request hit the database with a complex JOIN. Adding Redis caching dropped it to 20ms on cache hits. Perfect, we thought.

Six months later, we faced the cache invalidation problem. Users updated their profiles, but the cache had the old version for 5 minutes. Features depending on up-to-date profiles had race conditions. Cache invalidation became harder than the original optimization.

The real lesson came from looking at the database. Adding caching didn’t improve database load—it just delayed it. Cache misses still hit the database just as hard. What we actually needed was better indexing and query optimization. The cache masked the problem instead of solving it.

Real caching value came from computationally expensive operations. We had a “user feed” that involved computing recommendations by comparing the user to 10M items. Computing it on-demand took 3 seconds. Computing it every hour and caching it took 50ms to retrieve. Now caching wasn’t about hiding database load—it was about replacing expensive computation with fast retrieval.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# Example: Intelligent caching strategy

import redis
import hashlib
import json
from typing import Dict, Any
from datetime import datetime, timedelta

class SmartCache:
    def __init__(self):
        self.redis = redis.Redis(host='localhost', port=6379)
        self.PROFILE_CACHE_TTL = 300  # 5 minutes for profiles
        self.FEED_CACHE_TTL = 3600    # 1 hour for feeds
        self.COMPUTATION_CACHE_TTL = 7200  # 2 hours for expensive computations

    def get_user_profile(self, user_id: str) -> Dict:
        """
        Cache simple lookups with moderate TTL.
        Invalidate on writes - profile is updated frequently.
        """
        cache_key = f"profile:{user_id}"
        
        # Check cache first
        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # Cache miss - fetch from database
        profile = self._db_get_profile(user_id)
        
        # Cache with moderate TTL (5 minutes)
        self.redis.setex(cache_key, self.PROFILE_CACHE_TTL, json.dumps(profile))
        return profile

    def update_user_profile(self, user_id: str, new_data: Dict):
        """Update profile and immediately invalidate cache"""
        # Update database
        self._db_update_profile(user_id, new_data)
        
        # Invalidate cache - don't wait for TTL
        self.redis.delete(f"profile:{user_id}")

    def get_user_feed(self, user_id: str) -> Dict:
        """
        Cache expensive computations with long TTL.
        Recompute on schedule, not on access.
        """
        cache_key = f"feed:{user_id}"
        
        # Check cache
        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # Cache miss - compute expensive feed
        feed = self._compute_feed(user_id)  # Takes 3 seconds
        
        # Cache with long TTL (1 hour)
        self.redis.setex(cache_key, self.FEED_CACHE_TTL, json.dumps(feed))
        return feed

    def _compute_feed(self, user_id: str) -> Dict:
        """
        Expensive computation: compare user to 10M items.
        Cache this, don't recompute on every request.
        """
        # Simulated expensive operation
        items = []
        for item_id in range(10000000):
            if self._matches_user_preferences(user_id, item_id):
                items.append(item_id)
        return {"items": items[:100]}

    def _matches_user_preferences(self, user_id: str, item_id: int) -> bool:
        # Expensive comparison logic
        return True

    def invalidate_user_data(self, user_id: str):
        """Pattern: invalidate all caches for a user"""
        patterns = [
            f"profile:{user_id}",
            f"feed:{user_id}",
            f"recommendations:{user_id}"
        ]
        for pattern in patterns:
            self.redis.delete(pattern)

    def _db_get_profile(self, user_id: str) -> Dict:
        return {"id": user_id, "name": "User", "updated": datetime.now()}

    def _db_update_profile(self, user_id: str, data: Dict):
        pass

The strategy: cache expensive computations aggressively (long TTL), cache simple lookups conservatively (short TTL or invalidate on write), don’t cache things that rarely repeat.

Application: Cache is most valuable for read-heavy workloads with expensive computations. It’s least valuable for write-heavy workloads or simple database queries. If you’re caching to hide slow queries, fix the queries first—then cache to handle remaining load spikes.

The Real Bottleneck: Query Optimization Before Scaling

Core: Most scaling problems are actually query problems.

Before we sharded, we found that 40% of traffic came from 5 queries. Poorly written queries. Adding the right indexes dropped response time by 60%. That one day of optimization work was worth more than our entire replication and caching strategy.

Premature scaling is expensive. Mature scaling is valuable. Understand your bottleneck before you scale.

Hero Image Prompt: “Database architecture evolution showing progression: single database → replicated databases → sharded databases with caching layers. Show query flow with timing metrics (20ms, 3ms, etc). Include capacity graphs showing linear scaling with sharding. Technical, clean visualization with network connections between components. Dark theme with cyan (#16213e) accent lines showing query routing.”