Raft is a consensus algorithm designed for understandability. It's equivalent to Paxos in fault-tolerance and performance but much easier to understand and implement.
Developed by: Diego Ongaro & John Ousterhout at Stanford (2013)
Key Goal: "Understandability" - structured to be easy to learn and implement correctly
Core Principle: Strong leader - all log entries flow from leader to followers
| State | Role | Responsibilities | Transitions To |
|---|---|---|---|
| Follower | Passive |
• Respond to RPCs from leaders and candidates • Reset election timer on heartbeat • Vote in elections |
Candidate (timeout) |
| Candidate | Election participant |
• Start election • Request votes from other nodes • Vote for self |
Leader (majority) or Follower (lose) |
| Leader | Strong leader |
• Accept client requests • Replicate log to followers • Send periodic heartbeats • Manage commit index |
Follower (higher term discovered) |
Node grants vote if ALL conditions met:
Problem: Multiple candidates split the vote, no one gets majority
Example: With 4 nodes, Candidate A gets 2 votes, Candidate B gets 2 votes (need 3 for majority)
Solution: Randomized election timeouts (150-300ms)
Maintained by: AppendEntries consistency check - leader includes (prevLogIndex, prevLogTerm)
# Leader's log replication algorithm
def replicate_to_follower(follower_id):
while True:
next_index = leader.next_index[follower_id]
prev_index = next_index - 1
prev_term = leader.log[prev_index].term
# Send entries from next_index onwards
entries = leader.log[next_index:]
response = follower.append_entries(
prev_log_index=prev_index,
prev_log_term=prev_term,
entries=entries
)
if response.success:
# Follower accepted - update tracking
leader.match_index[follower_id] = prev_index + len(entries)
leader.next_index[follower_id] = leader.match_index[follower_id] + 1
break
else:
# Follower rejected - decrement next_index and retry
leader.next_index[follower_id] -= 1
# This will eventually find the point where logs match
| Property | Guarantee | How Achieved |
|---|---|---|
| Election Safety | At most one leader per term | Each node votes for at most one candidate per term |
| Leader Append-Only | Leader never overwrites/deletes log entries | Leader only appends to its log |
| Log Matching | Identical entries at same index/term | AppendEntries consistency check |
| Leader Completeness | Committed entries present in future leaders | Vote restriction - candidate must have all committed entries |
| State Machine Safety | All nodes apply same commands in same order | Log matching + leader completeness |
class RaftNode:
"""A node in the Raft cluster"""
def __init__(self, node_id: str, cluster_nodes: List[str]):
self.node_id = node_id
self.cluster_nodes = cluster_nodes
self.state = NodeState.FOLLOWER
# Persistent state (survives crashes)
self.current_term = 0
self.voted_for: Optional[str] = None
self.log: List[LogEntry] = [LogEntry(0, None, 0)] # Dummy at index 0
# Volatile state
self.commit_index = 0
self.last_applied = 0
# Leader state
self.next_index: Dict[str, int] = {} # For each server, next log index to send
self.match_index: Dict[str, int] = {} # For each server, highest replicated index
# Election timing
self.election_timeout = random.uniform(150, 300) # milliseconds
self.last_heartbeat = time.time()
def become_candidate(self):
"""Transition to candidate state and start election"""
self.state = NodeState.CANDIDATE
self.current_term += 1
self.voted_for = self.node_id # Vote for self
self.votes_received = {self.node_id}
self.last_heartbeat = time.time()
def become_leader(self):
"""Transition to leader state"""
self.state = NodeState.LEADER
# Initialize leader state
last_log_index = len(self.log) - 1
for node in self.cluster_nodes:
if node != self.node_id:
self.next_index[node] = last_log_index + 1
self.match_index[node] = 0
def request_vote(self, request: VoteRequest) -> VoteResponse:
"""Handle vote request from candidate
Grant vote if:
1. Haven't voted in this term (or already voted for this candidate)
2. Candidate's log is at least as up-to-date as receiver's log
"""
# Update term if request has higher term
if request.term > self.current_term:
self.become_follower(request.term)
# Deny vote if we're in a later term
if request.term < self.current_term:
return VoteResponse(self.current_term, False)
# Check if we've already voted
if self.voted_for is not None and self.voted_for != request.candidate_id:
return VoteResponse(self.current_term, False)
# Check if candidate's log is at least as up-to-date
last_log_index = len(self.log) - 1
last_log_term = self.log[last_log_index].term
log_is_up_to_date = (
request.last_log_term > last_log_term or
(request.last_log_term == last_log_term and
request.last_log_index >= last_log_index)
)
if log_is_up_to_date:
self.voted_for = request.candidate_id
self.last_heartbeat = time.time()
return VoteResponse(self.current_term, True)
return VoteResponse(self.current_term, False)
def append_entries(self, request: AppendEntriesRequest) -> AppendEntriesResponse:
"""Handle append entries request from leader
Also serves as heartbeat to prevent elections
"""
# Update term if request has higher term
if request.term > self.current_term:
self.become_follower(request.term)
# Reject if request is from an old term
if request.term < self.current_term:
return AppendEntriesResponse(self.current_term, False)
# Reset election timer (received heartbeat)
self.last_heartbeat = time.time()
# Consistency check: verify log contains entry at prev_log_index
if request.prev_log_index > 0:
if len(self.log) <= request.prev_log_index:
return AppendEntriesResponse(self.current_term, False)
if self.log[request.prev_log_index].term != request.prev_log_term:
# Remove conflicting entries
self.log = self.log[:request.prev_log_index]
return AppendEntriesResponse(self.current_term, False)
# Append new entries
for i, entry in enumerate(request.entries):
index = request.prev_log_index + 1 + i
if index < len(self.log):
self.log[index] = entry # Overwrite conflicting entry
else:
self.log.append(entry) # Append new entry
# Update commit index
if request.leader_commit > self.commit_index:
self.commit_index = min(request.leader_commit, len(self.log) - 1)
match_index = request.prev_log_index + len(request.entries)
return AppendEntriesResponse(self.current_term, True, match_index)
| Feature | Raft | Paxos |
|---|---|---|
| Design Goal | Understandability | Theoretical completeness |
| Leader | Strong leader (required) | No leader (Multi-Paxos adds one) |
| Log Structure | Leader's log is source of truth | No distinguished log |
| Membership Changes | Built-in (joint consensus) | Not specified |
| Learning Curve | Easier to understand | Notoriously difficult |
| Implementation | Many open-source implementations | Fewer, mostly proprietary |
Purpose: Distributed key-value store (Kubernetes' backing store)
Consensus: Raft for leader election and log replication
Scale: Millions of deployments worldwide
Purpose: Service mesh and service discovery
Consensus: Raft for maintaining cluster state
Features: Health checking, KV store, multi-datacenter
Purpose: Distributed SQL database
Consensus: Raft for replication of data ranges
Innovation: Raft used per-range, not cluster-wide