Implementing Distributed Systems Concepts
Distributed systems are complex, but understanding key concepts and patterns makes them manageable. This guide covers practical implementation of consensus algorithms, distributed coordination, and fault tolerance.
Core Concepts
Consensus Algorithms
Consensus ensures multiple nodes agree on a single value, even in the presence of failures.
Raft Algorithm Implementation:
class RaftNode:
def __init__(self, node_id):
self.node_id = node_id
self.state = "follower"
self.current_term = 0
self.voted_for = None
self.log = []
def start_election(self):
self.state = "candidate"
self.current_term += 1
self.voted_for = self.node_id
# Request votes from other nodes
Distributed Locks
Implement distributed locking using Redis:
import redis
import time
import uuid
class DistributedLock:
def __init__(self, redis_client, key, timeout=10):
self.redis = redis_client
self.key = key
self.timeout = timeout
self.identifier = str(uuid.uuid4())
def acquire(self):
end = time.time() + self.timeout
while time.time() < end:
if self.redis.set(self.key, self.identifier, nx=True, ex=self.timeout):
return True
time.sleep(0.001)
return False
def release(self):
pipe = self.redis.pipeline(True)
while True:
try:
pipe.watch(self.key)
if pipe.get(self.key) == self.identifier:
pipe.multi()
pipe.delete(self.key)
pipe.execute()
return True
pipe.unwatch()
break
except redis.WatchError:
pass
return False
Fault Tolerance Patterns
Circuit Breaker
Prevent cascading failures:
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
Best Practices
- Design for Partial Failures: Always assume some components will be unavailable
- Implement Idempotency: Operations should be safe to retry
- Use Timeouts: Every network call should have a timeout
- Monitor and Alert: Track system health and performance metrics
Distributed systems require careful design and implementation, but these patterns provide a solid foundation for building resilient, scalable systems.