Raft Consensus Algorithm

What is Raft?

Raft is a consensus algorithm designed for understandability. It's equivalent to Paxos in fault-tolerance and performance but much easier to understand and implement.

Developed by: Diego Ongaro & John Ousterhout at Stanford (2013)

Key Goal: "Understandability" - structured to be easy to learn and implement correctly

Core Principle: Strong leader - all log entries flow from leader to followers

Node States & Transitions

stateDiagram-v2 [*] --> Follower: Start Follower --> Candidate: Election timeout
No heartbeat from leader Candidate --> Leader: Receives votes
from majority Candidate --> Follower: Discovers current leader
or new term Candidate --> Candidate: Split vote
Timeout & retry Leader --> Follower: Discovers server
with higher term note right of Follower • Passive state • Responds to RPCs • Resets timer on heartbeat end note note right of Candidate • Requests votes • Increments term • Votes for self end note note right of Leader • Handles client requests • Sends heartbeats • Replicates log entries end note

Three States

State Role Responsibilities Transitions To
Follower Passive • Respond to RPCs from leaders and candidates
• Reset election timer on heartbeat
• Vote in elections
Candidate (timeout)
Candidate Election participant • Start election
• Request votes from other nodes
• Vote for self
Leader (majority) or Follower (lose)
Leader Strong leader • Accept client requests
• Replicate log to followers
• Send periodic heartbeats
• Manage commit index
Follower (higher term discovered)

Terms - Logical Time

graph LR T1["Term 1
Leader: Node A
Normal operation"] E1["Election
Split vote"] T2["Term 2
No leader
(election failed)"] E2["Election
Node B wins"] T3["Term 3
Leader: Node B
Normal operation"] E3["Election
Node C wins"] T4["Term 4
Leader: Node C
Normal operation"] T1 --> E1 E1 --> T2 T2 --> E2 E2 --> T3 T3 --> E3 E3 --> T4 style T1 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style T2 fill:#ffcdd2,stroke:#c62828,stroke-width:2px,color:#2e3440 style T3 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style T4 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style E1 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#2e3440 style E2 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#2e3440 style E3 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#2e3440

Term Properties

Leader Election

sequenceDiagram participant F1 as Node 1
(Follower) participant F2 as Node 2
(Follower) participant C as Node 3
(Candidate) participant F4 as Node 4
(Follower) participant F5 as Node 5
(Follower) Note over C: Election timeout!
No heartbeat received rect rgb(235, 203, 139, 0.2) Note over C: Become Candidate
Increment term to 5
Vote for self end C->>F1: RequestVote(term=5, candidate=3) Note over F1: term 5 > my term 4
Haven't voted yet ✓ F1-->>C: VoteGranted=True C->>F2: RequestVote(term=5, candidate=3) Note over F2: term 5 > my term 4
Haven't voted yet ✓ F2-->>C: VoteGranted=True C->>F4: RequestVote(term=5, candidate=3) Note over F4: Already voted for Node 5 ✗ F4-->>C: VoteGranted=False C->>F5: RequestVote(term=5, candidate=3) Note over F5: term 5 > my term 4
Haven't voted yet ✓ F5-->>C: VoteGranted=True Note over C: Votes: 4/5 (self + 3 others)
Majority: 3/5 ✓ rect rgb(163, 190, 140, 0.2) Note over C: Become Leader for term 5
👑 Start sending heartbeats end C->>F1: AppendEntries(term=5, heartbeat) C->>F2: AppendEntries(term=5, heartbeat) C->>F4: AppendEntries(term=5, heartbeat) C->>F5: AppendEntries(term=5, heartbeat) Note over F1,F5: All nodes recognize Node 3 as leader

Election Rules

When to Start Election

Vote Granting Rules

Node grants vote if ALL conditions met:

Split Vote Scenario

Problem: Multiple candidates split the vote, no one gets majority

Example: With 4 nodes, Candidate A gets 2 votes, Candidate B gets 2 votes (need 3 for majority)

Solution: Randomized election timeouts (150-300ms)

Log Replication

sequenceDiagram participant Client participant Leader as Leader
(Node 1) participant F2 as Follower 2 participant F3 as Follower 3 participant F4 as Follower 4 participant F5 as Follower 5 Client->>Leader: Request: SET x=5 rect rgb(235, 203, 139, 0.1) Note over Leader: 1. Append to local log
Entry(term=3, index=7, cmd="SET x=5") end Leader->>F2: AppendEntries(term=3, entries=[Entry7]) Leader->>F3: AppendEntries(term=3, entries=[Entry7]) Leader->>F4: AppendEntries(term=3, entries=[Entry7]) Leader->>F5: AppendEntries(term=3, entries=[Entry7]) F2->>F2: Append Entry7 F2-->>Leader: Success (match_index=7) F3->>F3: Append Entry7 F3-->>Leader: Success (match_index=7) F4->>F4: Append Entry7 F4-->>Leader: Success (match_index=7) Note over F5: Network partition ✗ F5-->>Leader: (no response) rect rgb(163, 190, 140, 0.2) Note over Leader: Majority replicated! (3/5)
✓ Commit index = 7 end Leader->>Leader: Apply "SET x=5" to state machine Leader-->>Client: Success: x=5 Note over Leader,F5: Next heartbeat will update commit index on followers Leader->>F2: AppendEntries(commit_index=7) Leader->>F3: AppendEntries(commit_index=7) Leader->>F4: AppendEntries(commit_index=7) Note over F2,F4: Followers apply committed entries

Log Structure

graph LR subgraph Leader["Leader Log (committed ✓)"] L1["1
term:1
x=1"] L2["2
term:1
y=2"] L3["3
term:2
z=3"] L4["4
term:3
x=5"] L5["5
term:3
w=7"] end subgraph Follower1["Follower 1 (up-to-date)"] F1_1["1
term:1
x=1"] F1_2["2
term:1
y=2"] F1_3["3
term:2
z=3"] F1_4["4
term:3
x=5"] F1_5["5
term:3
w=7"] end subgraph Follower2["Follower 2 (behind)"] F2_1["1
term:1
x=1"] F2_2["2
term:1
y=2"] F2_3["3
term:2
z=3"] end subgraph Follower3["Follower 3 (conflicting)"] F3_1["1
term:1
x=1"] F3_2["2
term:1
y=2"] F3_3["3
term:2
a=9"] F3_4["4
term:2
b=8"] end L1 --> L2 L2 --> L3 L3 --> L4 L4 --> L5 F1_1 --> F1_2 F1_2 --> F1_3 F1_3 --> F1_4 F1_4 --> F1_5 F2_1 --> F2_2 F2_2 --> F2_3 F3_1 --> F3_2 F3_2 --> F3_3 F3_3 --> F3_4 style L1 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L2 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L3 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L4 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L5 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style F3_3 fill:#ffcdd2,stroke:#c62828,stroke-width:2px,color:#2e3440 style F3_4 fill:#ffcdd2,stroke:#c62828,stroke-width:2px,color:#2e3440

Log Matching Property

Raft Guarantees

Maintained by: AppendEntries consistency check - leader includes (prevLogIndex, prevLogTerm)

Handling Inconsistencies

# Leader's log replication algorithm
def replicate_to_follower(follower_id):
    while True:
        next_index = leader.next_index[follower_id]
        prev_index = next_index - 1
        prev_term = leader.log[prev_index].term

        # Send entries from next_index onwards
        entries = leader.log[next_index:]

        response = follower.append_entries(
            prev_log_index=prev_index,
            prev_log_term=prev_term,
            entries=entries
        )

        if response.success:
            # Follower accepted - update tracking
            leader.match_index[follower_id] = prev_index + len(entries)
            leader.next_index[follower_id] = leader.match_index[follower_id] + 1
            break
        else:
            # Follower rejected - decrement next_index and retry
            leader.next_index[follower_id] -= 1
            # This will eventually find the point where logs match

Safety Properties

Property Guarantee How Achieved
Election Safety At most one leader per term Each node votes for at most one candidate per term
Leader Append-Only Leader never overwrites/deletes log entries Leader only appends to its log
Log Matching Identical entries at same index/term AppendEntries consistency check
Leader Completeness Committed entries present in future leaders Vote restriction - candidate must have all committed entries
State Machine Safety All nodes apply same commands in same order Log matching + leader completeness

Python Implementation

RaftNode Class

class RaftNode:
    """A node in the Raft cluster"""

    def __init__(self, node_id: str, cluster_nodes: List[str]):
        self.node_id = node_id
        self.cluster_nodes = cluster_nodes
        self.state = NodeState.FOLLOWER

        # Persistent state (survives crashes)
        self.current_term = 0
        self.voted_for: Optional[str] = None
        self.log: List[LogEntry] = [LogEntry(0, None, 0)]  # Dummy at index 0

        # Volatile state
        self.commit_index = 0
        self.last_applied = 0

        # Leader state
        self.next_index: Dict[str, int] = {}  # For each server, next log index to send
        self.match_index: Dict[str, int] = {}  # For each server, highest replicated index

        # Election timing
        self.election_timeout = random.uniform(150, 300)  # milliseconds
        self.last_heartbeat = time.time()

    def become_candidate(self):
        """Transition to candidate state and start election"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id  # Vote for self
        self.votes_received = {self.node_id}
        self.last_heartbeat = time.time()

    def become_leader(self):
        """Transition to leader state"""
        self.state = NodeState.LEADER

        # Initialize leader state
        last_log_index = len(self.log) - 1
        for node in self.cluster_nodes:
            if node != self.node_id:
                self.next_index[node] = last_log_index + 1
                self.match_index[node] = 0

RequestVote RPC

def request_vote(self, request: VoteRequest) -> VoteResponse:
    """Handle vote request from candidate

    Grant vote if:
    1. Haven't voted in this term (or already voted for this candidate)
    2. Candidate's log is at least as up-to-date as receiver's log
    """
    # Update term if request has higher term
    if request.term > self.current_term:
        self.become_follower(request.term)

    # Deny vote if we're in a later term
    if request.term < self.current_term:
        return VoteResponse(self.current_term, False)

    # Check if we've already voted
    if self.voted_for is not None and self.voted_for != request.candidate_id:
        return VoteResponse(self.current_term, False)

    # Check if candidate's log is at least as up-to-date
    last_log_index = len(self.log) - 1
    last_log_term = self.log[last_log_index].term

    log_is_up_to_date = (
        request.last_log_term > last_log_term or
        (request.last_log_term == last_log_term and
         request.last_log_index >= last_log_index)
    )

    if log_is_up_to_date:
        self.voted_for = request.candidate_id
        self.last_heartbeat = time.time()
        return VoteResponse(self.current_term, True)

    return VoteResponse(self.current_term, False)

AppendEntries RPC

def append_entries(self, request: AppendEntriesRequest) -> AppendEntriesResponse:
    """Handle append entries request from leader

    Also serves as heartbeat to prevent elections
    """
    # Update term if request has higher term
    if request.term > self.current_term:
        self.become_follower(request.term)

    # Reject if request is from an old term
    if request.term < self.current_term:
        return AppendEntriesResponse(self.current_term, False)

    # Reset election timer (received heartbeat)
    self.last_heartbeat = time.time()

    # Consistency check: verify log contains entry at prev_log_index
    if request.prev_log_index > 0:
        if len(self.log) <= request.prev_log_index:
            return AppendEntriesResponse(self.current_term, False)

        if self.log[request.prev_log_index].term != request.prev_log_term:
            # Remove conflicting entries
            self.log = self.log[:request.prev_log_index]
            return AppendEntriesResponse(self.current_term, False)

    # Append new entries
    for i, entry in enumerate(request.entries):
        index = request.prev_log_index + 1 + i
        if index < len(self.log):
            self.log[index] = entry  # Overwrite conflicting entry
        else:
            self.log.append(entry)  # Append new entry

    # Update commit index
    if request.leader_commit > self.commit_index:
        self.commit_index = min(request.leader_commit, len(self.log) - 1)

    match_index = request.prev_log_index + len(request.entries)
    return AppendEntriesResponse(self.current_term, True, match_index)

Raft vs Paxos Comparison

Feature Raft Paxos
Design Goal Understandability Theoretical completeness
Leader Strong leader (required) No leader (Multi-Paxos adds one)
Log Structure Leader's log is source of truth No distinguished log
Membership Changes Built-in (joint consensus) Not specified
Learning Curve Easier to understand Notoriously difficult
Implementation Many open-source implementations Fewer, mostly proprietary

Real-World Usage

etcd

Purpose: Distributed key-value store (Kubernetes' backing store)

Consensus: Raft for leader election and log replication

Scale: Millions of deployments worldwide

Consul

Purpose: Service mesh and service discovery

Consensus: Raft for maintaining cluster state

Features: Health checking, KV store, multi-datacenter

CockroachDB

Purpose: Distributed SQL database

Consensus: Raft for replication of data ranges

Innovation: Raft used per-range, not cluster-wide

Interview Tips

Common Questions

Key Insights