Raft Consensus Algorithm

← Back States Election Replication Code

What is Raft?

Raft is a consensus algorithm designed for understandability. It's equivalent to Paxos in fault-tolerance and performance but much easier to understand and implement.

Developed by: Diego Ongaro & John Ousterhout at Stanford (2013)

Key Goal: "Understandability" - structured to be easy to learn and implement correctly

Core Principle: Strong leader - all log entries flow from leader to followers

Node States & Transitions

stateDiagram-v2 [*] --> Follower: Start Follower --> Candidate: Election timeout
No heartbeat from leader Candidate --> Leader: Receives votes
from majority Candidate --> Follower: Discovers current leader
or new term Candidate --> Candidate: Split vote
Timeout & retry Leader --> Follower: Discovers server
with higher term note right of Follower • Passive state • Responds to RPCs • Resets timer on heartbeat end note note right of Candidate • Requests votes • Increments term • Votes for self end note note right of Leader • Handles client requests • Sends heartbeats • Replicates log entries end note

Three States

State	Role	Responsibilities	Transitions To
Follower	Passive	• Respond to RPCs from leaders and candidates • Reset election timer on heartbeat • Vote in elections	Candidate (timeout)
Candidate	Election participant	• Start election • Request votes from other nodes • Vote for self	Leader (majority) or Follower (lose)
Leader	Strong leader	• Accept client requests • Replicate log to followers • Send periodic heartbeats • Manage commit index	Follower (higher term discovered)

Terms - Logical Time

graph LR T1["Term 1
Leader: Node A
Normal operation"] E1["Election
Split vote"] T2["Term 2
No leader
(election failed)"] E2["Election
Node B wins"] T3["Term 3
Leader: Node B
Normal operation"] E3["Election
Node C wins"] T4["Term 4
Leader: Node C
Normal operation"] T1 --> E1 E1 --> T2 T2 --> E2 E2 --> T3 T3 --> E3 E3 --> T4 style T1 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style T2 fill:#ffcdd2,stroke:#c62828,stroke-width:2px,color:#2e3440 style T3 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style T4 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style E1 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#2e3440 style E2 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#2e3440 style E3 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#2e3440

            Term Properties
            Monotonically increasing: Terms never decrease
At most one leader per term: Split votes result in no leader
Logical clock: Used to detect stale information
Election starts each term: Each term begins with election attempt

        

Leader Election

sequenceDiagram participant F1 as Node 1
(Follower) participant F2 as Node 2
(Follower) participant C as Node 3
(Candidate) participant F4 as Node 4
(Follower) participant F5 as Node 5
(Follower) Note over C: Election timeout!
No heartbeat received rect rgb(235, 203, 139, 0.2) Note over C: Become Candidate
Increment term to 5
Vote for self end C->>F1: RequestVote(term=5, candidate=3) Note over F1: term 5 > my term 4
Haven't voted yet ✓ F1-->>C: VoteGranted=True C->>F2: RequestVote(term=5, candidate=3) Note over F2: term 5 > my term 4
Haven't voted yet ✓ F2-->>C: VoteGranted=True C->>F4: RequestVote(term=5, candidate=3) Note over F4: Already voted for Node 5 ✗ F4-->>C: VoteGranted=False C->>F5: RequestVote(term=5, candidate=3) Note over F5: term 5 > my term 4
Haven't voted yet ✓ F5-->>C: VoteGranted=True Note over C: Votes: 4/5 (self + 3 others)
Majority: 3/5 ✓ rect rgb(163, 190, 140, 0.2) Note over C: Become Leader for term 5
👑 Start sending heartbeats end C->>F1: AppendEntries(term=5, heartbeat) C->>F2: AppendEntries(term=5, heartbeat) C->>F4: AppendEntries(term=5, heartbeat) C->>F5: AppendEntries(term=5, heartbeat) Note over F1,F5: All nodes recognize Node 3 as leader

Election Rules

When to Start Election

Follower doesn't receive heartbeat within election timeout (150-300ms)
Candidate doesn't win election before timeout

Vote Granting Rules

Node grants vote if ALL conditions met:

Term check: Candidate's term ≥ voter's current term
Vote availability: Haven't voted for another candidate in this term
Log up-to-date: Candidate's log is at least as up-to-date as voter's log

Split Vote Scenario

Problem: Multiple candidates split the vote, no one gets majority

Example: With 4 nodes, Candidate A gets 2 votes, Candidate B gets 2 votes (need 3 for majority)

Solution: Randomized election timeouts (150-300ms)

Each candidate times out at different time
Reduces probability of simultaneous elections
Failed elections retry with new random timeout

Log Replication

sequenceDiagram participant Client participant Leader as Leader
(Node 1) participant F2 as Follower 2 participant F3 as Follower 3 participant F4 as Follower 4 participant F5 as Follower 5 Client->>Leader: Request: SET x=5 rect rgb(235, 203, 139, 0.1) Note over Leader: 1. Append to local log
Entry(term=3, index=7, cmd="SET x=5") end Leader->>F2: AppendEntries(term=3, entries=[Entry7]) Leader->>F3: AppendEntries(term=3, entries=[Entry7]) Leader->>F4: AppendEntries(term=3, entries=[Entry7]) Leader->>F5: AppendEntries(term=3, entries=[Entry7]) F2->>F2: Append Entry7 F2-->>Leader: Success (match_index=7) F3->>F3: Append Entry7 F3-->>Leader: Success (match_index=7) F4->>F4: Append Entry7 F4-->>Leader: Success (match_index=7) Note over F5: Network partition ✗ F5-->>Leader: (no response) rect rgb(163, 190, 140, 0.2) Note over Leader: Majority replicated! (3/5)
✓ Commit index = 7 end Leader->>Leader: Apply "SET x=5" to state machine Leader-->>Client: Success: x=5 Note over Leader,F5: Next heartbeat will update commit index on followers Leader->>F2: AppendEntries(commit_index=7) Leader->>F3: AppendEntries(commit_index=7) Leader->>F4: AppendEntries(commit_index=7) Note over F2,F4: Followers apply committed entries

Log Structure

graph LR subgraph Leader["Leader Log (committed ✓)"] L1["1
term:1
x=1"] L2["2
term:1
y=2"] L3["3
term:2
z=3"] L4["4
term:3
x=5"] L5["5
term:3
w=7"] end subgraph Follower1["Follower 1 (up-to-date)"] F1_1["1
term:1
x=1"] F1_2["2
term:1
y=2"] F1_3["3
term:2
z=3"] F1_4["4
term:3
x=5"] F1_5["5
term:3
w=7"] end subgraph Follower2["Follower 2 (behind)"] F2_1["1
term:1
x=1"] F2_2["2
term:1
y=2"] F2_3["3
term:2
z=3"] end subgraph Follower3["Follower 3 (conflicting)"] F3_1["1
term:1
x=1"] F3_2["2
term:1
y=2"] F3_3["3
term:2
a=9"] F3_4["4
term:2
b=8"] end L1 --> L2 L2 --> L3 L3 --> L4 L4 --> L5 F1_1 --> F1_2 F1_2 --> F1_3 F1_3 --> F1_4 F1_4 --> F1_5 F2_1 --> F2_2 F2_2 --> F2_3 F3_1 --> F3_2 F3_2 --> F3_3 F3_3 --> F3_4 style L1 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L2 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L3 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L4 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style L5 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#2e3440 style F3_3 fill:#ffcdd2,stroke:#c62828,stroke-width:2px,color:#2e3440 style F3_4 fill:#ffcdd2,stroke:#c62828,stroke-width:2px,color:#2e3440

Log Matching Property

Raft Guarantees

If two entries have same index and term: They store the same command
If two entries have same index and term: All preceding entries are identical

Maintained by: AppendEntries consistency check - leader includes (prevLogIndex, prevLogTerm)

Handling Inconsistencies

# Leader's log replication algorithm
def replicate_to_follower(follower_id):
    while True:
        next_index = leader.next_index[follower_id]
        prev_index = next_index - 1
        prev_term = leader.log[prev_index].term

        # Send entries from next_index onwards
        entries = leader.log[next_index:]

        response = follower.append_entries(
            prev_log_index=prev_index,
            prev_log_term=prev_term,
            entries=entries
        )

        if response.success:
            # Follower accepted - update tracking
            leader.match_index[follower_id] = prev_index + len(entries)
            leader.next_index[follower_id] = leader.match_index[follower_id] + 1
            break
        else:
            # Follower rejected - decrement next_index and retry
            leader.next_index[follower_id] -= 1
            # This will eventually find the point where logs match

Safety Properties

Property	Guarantee	How Achieved
Election Safety	At most one leader per term	Each node votes for at most one candidate per term
Leader Append-Only	Leader never overwrites/deletes log entries	Leader only appends to its log
Log Matching	Identical entries at same index/term	AppendEntries consistency check
Leader Completeness	Committed entries present in future leaders	Vote restriction - candidate must have all committed entries
State Machine Safety	All nodes apply same commands in same order	Log matching + leader completeness

Python Implementation

RaftNode Class

class RaftNode:
    """A node in the Raft cluster"""

    def __init__(self, node_id: str, cluster_nodes: List[str]):
        self.node_id = node_id
        self.cluster_nodes = cluster_nodes
        self.state = NodeState.FOLLOWER

        # Persistent state (survives crashes)
        self.current_term = 0
        self.voted_for: Optional[str] = None
        self.log: List[LogEntry] = [LogEntry(0, None, 0)]  # Dummy at index 0

        # Volatile state
        self.commit_index = 0
        self.last_applied = 0

        # Leader state
        self.next_index: Dict[str, int] = {}  # For each server, next log index to send
        self.match_index: Dict[str, int] = {}  # For each server, highest replicated index

        # Election timing
        self.election_timeout = random.uniform(150, 300)  # milliseconds
        self.last_heartbeat = time.time()

    def become_candidate(self):
        """Transition to candidate state and start election"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id  # Vote for self
        self.votes_received = {self.node_id}
        self.last_heartbeat = time.time()

    def become_leader(self):
        """Transition to leader state"""
        self.state = NodeState.LEADER

        # Initialize leader state
        last_log_index = len(self.log) - 1
        for node in self.cluster_nodes:
            if node != self.node_id:
                self.next_index[node] = last_log_index + 1
                self.match_index[node] = 0

RequestVote RPC

def request_vote(self, request: VoteRequest) -> VoteResponse:
    """Handle vote request from candidate

    Grant vote if:
    1. Haven't voted in this term (or already voted for this candidate)
    2. Candidate's log is at least as up-to-date as receiver's log
    """
    # Update term if request has higher term
    if request.term > self.current_term:
        self.become_follower(request.term)

    # Deny vote if we're in a later term
    if request.term < self.current_term:
        return VoteResponse(self.current_term, False)

    # Check if we've already voted
    if self.voted_for is not None and self.voted_for != request.candidate_id:
        return VoteResponse(self.current_term, False)

    # Check if candidate's log is at least as up-to-date
    last_log_index = len(self.log) - 1
    last_log_term = self.log[last_log_index].term

    log_is_up_to_date = (
        request.last_log_term > last_log_term or
        (request.last_log_term == last_log_term and
         request.last_log_index >= last_log_index)
    )

    if log_is_up_to_date:
        self.voted_for = request.candidate_id
        self.last_heartbeat = time.time()
        return VoteResponse(self.current_term, True)

    return VoteResponse(self.current_term, False)

AppendEntries RPC

def append_entries(self, request: AppendEntriesRequest) -> AppendEntriesResponse:
    """Handle append entries request from leader

    Also serves as heartbeat to prevent elections
    """
    # Update term if request has higher term
    if request.term > self.current_term:
        self.become_follower(request.term)

    # Reject if request is from an old term
    if request.term < self.current_term:
        return AppendEntriesResponse(self.current_term, False)

    # Reset election timer (received heartbeat)
    self.last_heartbeat = time.time()

    # Consistency check: verify log contains entry at prev_log_index
    if request.prev_log_index > 0:
        if len(self.log) <= request.prev_log_index:
            return AppendEntriesResponse(self.current_term, False)

        if self.log[request.prev_log_index].term != request.prev_log_term:
            # Remove conflicting entries
            self.log = self.log[:request.prev_log_index]
            return AppendEntriesResponse(self.current_term, False)

    # Append new entries
    for i, entry in enumerate(request.entries):
        index = request.prev_log_index + 1 + i
        if index < len(self.log):
            self.log[index] = entry  # Overwrite conflicting entry
        else:
            self.log.append(entry)  # Append new entry

    # Update commit index
    if request.leader_commit > self.commit_index:
        self.commit_index = min(request.leader_commit, len(self.log) - 1)

    match_index = request.prev_log_index + len(request.entries)
    return AppendEntriesResponse(self.current_term, True, match_index)

Raft vs Paxos Comparison

Feature	Raft	Paxos
Design Goal	Understandability	Theoretical completeness
Leader	Strong leader (required)	No leader (Multi-Paxos adds one)
Log Structure	Leader's log is source of truth	No distinguished log
Membership Changes	Built-in (joint consensus)	Not specified
Learning Curve	Easier to understand	Notoriously difficult
Implementation	Many open-source implementations	Fewer, mostly proprietary

Real-World Usage

etcd

Purpose: Distributed key-value store (Kubernetes' backing store)

Consensus: Raft for leader election and log replication

Scale: Millions of deployments worldwide

Consul

Purpose: Service mesh and service discovery

Consensus: Raft for maintaining cluster state

Features: Health checking, KV store, multi-datacenter

CockroachDB

Purpose: Distributed SQL database

Consensus: Raft for replication of data ranges

Innovation: Raft used per-range, not cluster-wide

Interview Tips

Common Questions

"Explain Raft in 30 seconds" → Leader-based consensus: (1) Elect leader via voting, (2) Leader replicates log to followers, (3) Commit when majority has entry
"Why randomized timeouts?" → Prevent split votes - nodes timeout at different times, reducing simultaneous elections
"What happens if leader crashes?" → Followers timeout, start election, new leader elected with all committed entries
"How does Raft ensure safety?" → Vote restriction ensures new leader has all committed entries; AppendEntries checks prevent log conflicts
"What is a term?" → Logical time period starting with election; at most one leader per term
"Raft vs Paxos?" → Raft: easier to understand, strong leader. Paxos: more flexible, harder to implement
"When is an entry committed?" → When replicated on majority of servers

Key Insights

Strong leader simplifies design: All client requests go through leader
Randomized timeouts prevent livelock: Unlike Paxos, Raft election converges quickly
Log matching property: If two logs have entry at same index/term, all prior entries identical
Safety over liveness: Won't make progress without majority, but never returns wrong answer
Membership changes: Use joint consensus phase (old + new config overlap)
Log compaction: Snapshot state machine, discard old log entries

← Paxos Consensus Back to Index