Making sense of the RAFT Distributed Consensus Algorithm — Part 2

Kousik Nath
codeburst
Published in
19 min readDec 12, 2020

--

Courtesy: Unsplash

In the part 1 of this series, we got the basics of Raft. Please go through that part first if you have not already.

In this part, we’ll concentrate on detailed Raft replication technique. The concepts explained here are very important & core to Raft.

Raft Replication

The core idea behind any consensus algorithm is that for a given data ( key ), at any point in time, either all or majority of the cluster members should return the same value. Raft uses replication technique to achieve it. Replication has been used to build fault tolerance & redundancy in distributed systems since ages. Raft maintains identical replicated state machines across all nodes by making sure that the client commands saved in the logs are identical and are in the same order.

Once a leader is elected, the high level replication process looks like this:

Figure 10: High level replication, Courtesy: The Raft extended paper

Consensus module shown above is a logical layer that exists in all the nodes. At a high level, it accepts a client command, takes care of replicating & committing it as per the algorithm:

Step 1: Client ( i.e; a distributed database system ) sends a command ( i.e; something like an INSERT command in SQL) to the server.
Step 2: The consensus module at the leader handles the command: puts it into the leader’s log file & sends it to all other nodes in parallel.
Step 3: If majority of the nodes including the leader replicate the command successfully to their local log & acknowledge to the leader, the leader then commits the command to its own state machine.
Step 4: The leader acknowledges the status of the commitment to the client.

Q. How is a log entry represented?
A. A log entry typically contains the following information:

index: An increasing sequence number for each entry in the log
term: counter value indicating the current term when the leader receives the entry from the client
command: The actual client data that we want to store in the system

Q. What is the importance of majority / quorum in the context of raft replication?
A.
Writing to a majority ( 3 out of 5 nodes in a 5 node cluster as example ) makes sure that as long as any 3 nodes are alive & connected to the cluster, we won’t not lose data since it exists in at least one of the nodes. To guarantee no data loss in the event of a failure, Raft uses quorum.

Q. Do participating nodes manage any states / variables?
A.
Following states are managed by all of the nodes. The index in log or index arrays are 1 based in the below description, however you can use zero-based as well. Carefully read the following diagram as these states / variables are going to be used heavily in further discussion.

Figure 11: Node variables

Q. What is the significance of lastApplied?
A.
lastApplied keeps track of the index in the local log till which entries got applied in the local state machine of a node. Remember that committing an entry to the log does not mean it’s applied immediately. Typically the component that applies entries to the state machine is different than the one which handles committing to the log. Assume that separate threads are used for performance improvement purpose. Hence lastApplied state is managed separately. commitIndex is usually few milliseconds ahead of lastApplied although eventually lastApplied catches up with commitIndex.

Q. What is the significance of matchIndex[]?
A.
Let’s look into this after we discuss Algorithm 2 in later section.

Let’s take a looks at how the RPC request & response structures look like. Their structure is going to be very import in upcoming discussion. They are pretty much self explanatory below:

AppendEntries (AE) RPC Request & Response:

Figure 12: Append Entries RPC request response

RequestVote (RV) RPC request & response

Figure 13: Request Vote RPC request response

Important Observation:
When a node receives AppendEntries & RequestVote requests from a sender, it returns its current term in the response. It causes the sender to update its own term in case the current node has higher term so that eventually all the nodes can agree on the appropriate term. Term update happens the other way also i.e; if the sender has higher term, the receiver updates its own term when it receives a request. As stated earlier, each node manages its own term. So this mechanism is crucial to ensure that terms across nodes converge to a single monotonically increasing value eventually.

As you can see in Figure 11, the leader keeps track the highest committed log entry index in a variable called commitIndex in its volatile state, it sends this value in AppendEntries RPC in the field leaderCommit so that the followers get to know the latest leader commit index & they can commit those entries as well. We’ll see how it happens when we discuss the related algorithms in details later.

As stated earlier, Raft ensures that logs across all the nodes are exactly the same. It makes the log abstraction simple but Raft employs certain techniques to give proper safety guarantee around this design decision.

Raft Guarantees

Raft is built around certain properties & it guarantees that these properties always hold true:

Leader Election Safety

For a given term, at most one leader would be elected. Since we have already seen that Raft uses quorum, unless a candidate gets (N/2) + 1 votes in the election process, it can’t become the leader. This means at most one leader would be chosen in a term.

Append-Only Leader

A leader never overrides or deletes an entry in its log. It just keeps on appending.

Log Matching Property

  • Property 1: If two entries in different logs have the same index & term, they store the same command. Since the leader stores the given command only once at a certain index with a certain term, this property is always true. For empty logs comparison also, this property holds good.
  • Property 2: If two entries in different logs have the same index & term, then the logs are identical in all preceding entries.

Leader Completeness

Once a log entry is committed in a given term, it will be present in the logs of the leaders for all higher terms. So committed entry never gets lost even though the leader changes.

State Machine Safety

If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

Remember, whatever algorithms, success or failure cases we discuss later, they revolve around these properties.

Important Algorithms

Since, we are diving deep into the replication section, let’s look at the following sample code snippets to understand the algorithm both at the leader & follower side. The code will help you to visualize different variables & states managed by the servers (it’s written in GoLang however you don’t need to know GoLang to understand this piece):

Note: We’ll focus only on the core parts of the algorithms, concurrency constructs like locking, sending message through channels in GoLang and all other stuffs are out of context of this article.

Leader Side Log Replication Algorithm

Let’s assume that we already have an acting leader. The algorithm describes what happens afterwards:

Algorithm 1: Leader Append Entries

Steps

  1. Line 6: Once a node is chosen as the leader, the consensus module assigns the Leader state to the node.
  2. Line 8–11: nextIndex[peerId] for each peer nodes is initialized to the length of the log of the leader. Remember, in the code, the log index is 0 based. When the leader starts up, it can only assume that all other peers are as up-to-date as it is. In case, some peer is lagging behind, the nextIndex[peerId] for that peer would be adjusted accordingly. Also the matchIndex[peerId] for each peer is initialized to -1 — it keeps track of the maximum log index till which the peer is exactly matching with the leader.
  3. Line 15: The leader starts a timer of 50 ms ( you can choose any appropriate timeout value ) to periodically send AppendEntries RPC to the followers. If there is no log entry to send, an AppendEntries with empty entries[] is sent which is considered as heartbeat. Heartbeat is necessary to prevent another unnecessary leader election when the current leader is working fine but it has no entry to replicate yet. Typically, when the leader starts, it immediately sends a heartbeat to the followers.
  4. Line 19–29: The leader keeps on sending AppendEntries RPC in the background.
  5. Line 35: The method name leaderSendHeartbeats() is misleading here since it’s actually sending AppendEntries RPC. As stated earlier, if there is no logs to send, that RPC can be called a heartbeat. However, even if there is some logs to send, still it can be considered as heartbeat only as any AppendEntries RPC logically means the leader is healthy, that’s why it’s sending these messages.
  6. Line 43–58: This is the AppendEntries request preparation phase. For log consistency checking purpose, we need to send previous log entry index & term, prevLogIndex is initialized with nextIndex[peerId]-1; if the log is empty, nextIndex[peerId] = 0 ,prevLogIndex = -1 & prevLogTerm = -1. Followers would be able to handle these negative values. If some log entries exist from nextIndex[peerId], the RPC would send all of them in the entries[] array. If no suitable log exists, entries[] is empty.
    Note that line 57 sends current leader commit index to the followers — it helps the followers to identify till what index, the leader has committed the logs, depending on it, followers can also commit their logs to their individual state machines.
  7. Line 62: AppendEntries RPC is sent to all the followers in parallel.
  8. Line 65–69: AppendEntries response from a follower contains its current term. In case the follower is a more suitable leader than the current leader, the response contains higher term than the current leader’s term. In that case, the current leader steps down, becomes a follower, resets election timer & other states as necessary. This step ensures that, Raft has a single leader & the most suitable leader is the given preference. We’ll see the leader election algorithm in later sections to clarify this part.
  9. Line 71–75: If the current leader is the most suitable one & the follower is successfully able to replicate entries[] to its log, it returns success = true in the AppendEntries response. At this point, the leader adjusts nextIndex[peerId] to previous value + length of whatever logs it sent in the request for that follower, matchIndex[peerId] is set to nextIndex[peerId]-1 since the follower is successfully able to replicate all the entries sent.
  10. Line 77–90: In this step, the leader finds out the maximum log index since the last commit index, till which majority of peers (including the leader itself) have been successfully able to replicate the leader logs — this becomes the leader’s new commitIndex. As we have seen the matchIndex[] keeps track of log index for individual replicas till which leader logs have been successfully replicated, the leader uses this information in this step to figure out the commit index for the current term here.
  11. In the next AppendEntries RPC call, the leader would pass the new commit index in leaderCommit field. The followers would apply some logic using this field to identify till what index they could commit their individual logs. In case, there is no change in the commit index since the last one, no issues would happen as there is nothing to commit.
  12. Line 95–98: In line 71–75, the follower successfully replicates the leader log & returns success because the leader log’s prevLogIndex & prevLogTerm are matching with the follower ( we’ll see this logic in the following section ). In case, the follower could not cope up with the leader ( probably because it crashed mid-way or network partition happened ), these values won’t match. So the follower returns false in AppendEntries response & the leader adjusts nextIndex[peerId] for the follower by decrementing it. Unless the follower catches up with the leader, this process is retried. There can be possible optimizations to reduce number of such calls here, it’s out of scope here.

Follower Side Replication Algorithm

The algorithm to append entry in the follower is relatively easier & goes like below:

Algorithm 2: Follower Append Entries

Steps

Line 19: the method parameter args AppendEntriesArgs represents the AppendEntries RPC request being passed by the leader to the followers.

  1. Line 27: If the leader has higher term, then the receiving node updates its term, resets election timer, in case it’s not in Follower state, it becomes a follower.
  2. Line 33: If the leader’s current term matches with the current term of the follower, the follower attempts to replicate the log.
  3. Line 42–43: The follower checks whether its log at index prevLogIndex matches with prevLogTerm — both sent in the RPC request by the leader. If the terms don’t match or the follower has a shorter log than the leader, the request fails, false is returned as response to the RPC & step 12 of the Algorithm 1 retries unless it finds an index in the follower log where both the leader & follower terms match. This step competes the log consistency check mentioned in the step 6 of the Algorithm 1.
  4. Line 49–71: AppendEntries RPC is idempotent. Consider a scenario where the leader sends a valid AppendEntries request to a follower, the follower replicates the entries properly but fails to acknowledge success to the RPC probably due to a temporary network glitch. Since the leader has not received the response after some time, it retries the same request. So the follower should be able to identify that the log entries in the request have been already applied. It should not re-replicate the entries again, thus it can save some disk IO as it’s a duplicate request. This is what the code segment does here. It tries to find all matching log entries by index & term for the request. The moment either the end of log file is reached or a mismatch is found in the logs, the remaining entries in entries[] of the AppendEntries request gets replicated. So if log entries exist after the found mismatch index in the follower log, they get overridden.
  5. Look at step 11 of the Leader side algorithm: we mentioned that leader sends its commit index to the followers in the AppendEntries RPC call.
    Line 74–77 :This code segment identifies the commit index at the follower side. The minimum of leader’s commit index & follower log’s size-1 (0-indexed log as we mentioned earlier) is taken as the commit index. The log is then applied to the follower’s state machine.

Q. What is the significance of matchIndex[] in the leader, how is it different from nextIndex[]?
A.
matchIndex[peerId] in the leader keeps track of till what index in that particular follower log, entries exactly match with the leader. The follower can still have more logs than that index probably not committed yet because in Algorithm 2 line 75, we observe that followers commit logs only when they get some appropriate leader commit index in AppendEntries RPC from the leader i.e; the leader commits some log, transmits the commit information in the next AppendEntries RPC, so followers wait till the next RPC before committing any log entry. Hence there is a possibility that nextIndex[peerId] points to an index in the follower log which is not committed yet in the follower. So logically the leader can replicate logs from matchIndex[peerId] + 1 to nextIndex[peerId] to the follower. Eventually in an ideal scenario, matchIndex[peerId] matches with nextIndex[peerId].

In another words, nextIndex[peerId] is the best guess by the leader about a follower’s log from what index to replicate next, if the follower is not up-to-date, the leader adjusts this value as observed in Algorithm 1, line no 96. matchIndex[peerId] is the exact index till what they are currently matching at some point in time.

In the “Leader Election” section earlier, we saw the basic leader election scenario when the cluster starts up i.e; there is no log in the system. Since we have already seen replication algorithm both at the leader & follower side, it’s a good time to examine how presence of logs in the system affect the leader election & voting process.

Leader Election Algorithm In the Presence of Logs or No Logs

In the absence of the leader, any node whose election timeout expires can initiate voting request process with the hope of becoming leader.

Algorithm 3: Leader Election

Steps

  1. Line 6: The current node assign Candidate state to itself since only candidates can participate in voting.
  2. Line 7: The candidate increases its current term since the previous term expired with the previous leader.
  3. Line 10: The candidate votes for itself.
  4. Line 19: The candidate retrieves the last index & corresponding term of its log entry.
  5. Line 22–26: With above information, the candidate initiates the RequestVote RPC process.
  6. Line 31: Other nodes / peers are asynchronously called with RequestVote RPC.
  7. Line 41–44: If any peer responds back to the RPC with higher term number, that peer is possibly ahead in the race to become the leader. So just to make sure that the leader election is smooth & only one leader exists at a point in time, the candidate steps down, becomes a follower & resets its election timer.
  8. Line 45–53: If majority of the peers respond back with the same term as the candidate & grant votes for the candidate (including the candidate’s vote for itself as we saw in step 3), the candidate wins the election & becomes the new leader. Hurrah!!!!

Q. What happens if the process fails to elect a leader?
A.
The process continues till the time some candidate is elected as the leader. Typically with one or two trials, the leader should be elected.

Q. Can there be multiple nodes in Candidate state at the same time?
A.
Yes, it’s likely a possible case. There can be competition also among multiple candidates, however random timeout as discussed earlier is designed to minimize such competition.

The leader election process looks quite simple, however not every candidate can win the election. There are certain rules which peers follow when they vote for a candidate to make sure that the elected leader is the most suitable one.

Vote Request Algorithm

Algorithm 4: Request Vote

Steps

  1. Line 17: A candidate invokes its peer’s request vote RPC with request parameters represented as args RequestVoteArgs.
  2. Line 23: The peer retrieves the last log index & corresponding term from its own log.
  3. Line 26–29: If the requesting candidate has higher term than the peer, the peer can step down to become a Follower as it does not make sense to continue as a Candidate since it would never gather vote to win any election due to the lower term. It updates its term to the same as the requester.
  4. Line 31–37: These code block is very crucial as it determines who can vote for a candidate & who can’t. This code block conforms to the “Leader Election Safety” guarantee property of Raft.
    Following checks are performed while voting:
    - If the peer’s term does not match with the candidate’s term, no vote is granted.
    - votedFor indicates which candidate the peer has already voted for in the current term. A peer can vote for only one candidate for a given term. RequestVote is idempotent i.e; if the peer has already voted for a candidate C & the same request is retried again, the peer can still grant vote for C. So the peer grants vote for a candidate only when it has not voted for any candidate yet or it has already voted for the same candidate earlier. Vote is rejected if none of these match.
    - The most important condition is: if the requesting candidate’s log is at least as updated as the peer, then only vote is granted. It means if the candidate’s lastLogTerm in RequestVote RPC is greater than the last log term in the peer it can be granted the vote. In case, both the last log terms are same, the candidate should have equal or longer log than the peer. Otherwise, the vote can’t be granted. This step is to ensure that we don’t lose any data by mistake while choosing the leader.
  5. Once vote is granted for a candidate,the peer updates votedFor state with the candidate id.

Let’s take an example to understand the voting & replication with little more details:

Figure 14, Courtesy: Raft Thesis

Step a: S1 is elected as the leader. Current term is 2. Let’s say X = log entry at index 2 of S1 with term 2. Before replicating X to the majority S1 crashed.

Step b: S5 becomes the new leader with votes from S3, S4 & itself since its term & log are updated as much as that of both S3 & S4. After being elected, S5 accepts a new entry at index 2 with term 3, let’s call it Y.
Q. Why is S5 chosen with term 3 not 2?
A.
When S1 is elected as the leader for term 2, the use case assumes that S5 also votes for S1 along with others. So at least the majority of the nodes including S5 knows that the term 2 is the highest term seen till now in the cluster, hence they update their local term as described in Algorithm 4 line 26. While initiating voting process for the next leader election, S5 naturally asks for leadership in term 3.

Step c: Before replicating Y to the majority, S5 crashed. New election happens. Both S1 & S2 can win the election since among S1, S2, S3, & S4, they are the most updated with logs & there is a clear majority. S1 wins the election for term 4. It continues replication & replicates X to S3. Note that, X is replicated to the majority but uncommitted - since S1’s current term 4 is greater than X’s term 2, S1 can’t commit the entry as Raft does not allow committing logs from previous entries ( see lines from 78 to 85 in Algorithm 1). Even though some entry from a previous term gets replicated to some follower, Raft does not keep track of how many replicas replicated that entry — Raft design is simple, it keeps track of replica count for entry belonging to the current term. Once an entry from the current term is committed, older entries get committed indirectly anyway.

Q. What if the current leader does not get any entry / client request in the current term, in that case, won’t the uncommitted entries belonging to any previous term get committed?
A. In order to tackle this scenario, once a new leader is elected, it can add an empty marker log entry with current term to its log & immediately sends an empty AppendEntry RPC to the peers which can force the followers to replicate all the currently existing logs in the leader log. Replication then happens as described earlier in Algorithm 1 & 2.

Coming back to step c, S1 accepts a new entry at index 3 with term 4, let’s call it Z. S1 now crashes.

Depending on the replication status of Z, there are now two choices:

Step d1: If Z is not replicated to the majority by S1 yet, S5 can win the election again with votes from S2, S3, & S4 since the last long entry of S5 i.e; Y has term 3 whereas S2 & S3 have term 2 in their last long entry. If S5 is elected, it can override entry at index 2 of all other nodes — so X gets lost now from index 2. Since X is replicated to the majority but not committed yet, the client is still waiting for a success response from the leader, so we can take the risk & the override can happen. Hence all nodes ends up with Y at index 2.

Step d2: But before electing S5 as the new leader, if Z gets replicated to S2 & S3, the replication gets majority. In that case, even though Z is not committed yet, S5 can’t win the election as its log is not as updated as the majority ( S5 has last log with term 3, whereas, majority has last log with term 4 ). Hence if S1 wins the election again, it commits Z & indirectly X also gets committed.

Pull vs Push Model

Raft is based on Push model where the leader takes the responsibility to keep track of the next index, match index of all the followers & drives the replication process in the cluster. However it’s not a mandatory thing. Depending on your system, you could choose Pull model as well where a follower takes care of its own replication, you need to make necessary changes in the code to achieve that.

Quick Summary

  • Raft works on the principal of distributed log replication.
  • Logs are replicated to the followers by the leader. It ensures that logs across the nodes in the cluster are in the same order as the leader.
  • The leader sends log replication request through AppendEntries RPC to the peers in parallel.
  • AppendEntries is treated as heartbeat when there is no entries present in the request.
  • Raft replication revolves around five guarantees — Leader Election Safety, Append Only Leader, Log Matching Property, Leader Completeness, State Machine Safety.
  • Leader log never gets overridden.
  • Both AppendEntries & RequestVote RPC requests carry current term of the requester to the peers so that peers can update their own term if they are lagging behind.
  • Similarly peers also pass on their current term to the requester through AppendEntries & RequestVote RPC response so that the requester can update its term in case it’s lagging behind.
  • The leader sends previous log entry & term to the follower in AppendEntries RPC. The followers checks whether its last long entry & term matches this information. If no match is found, the logs are not identical, the leader adjusts these values for the follower & sends the RPC again. The process goes on till the time the leader & the follower come to a mutual decision that their logs are matching at some point. This step is used as log consistency check. Once the logs match, the follower can override non-matching log entries with new entries.
  • The leader only commits log entries from the current term by counting whether majority of the followers have replicated the entry.
  • When there is no leader in the cluster, a random node becomes a candidate since its election timeout gets over. It votes for itself & sends RequestVote RPC to all other nodes.
  • While voting, in order to ensure no data gets lost in the system, it’s very important to guarantee leader election safety. Any candidate whose last log term & index is as updated as a peer is granted vote by the peer.
  • If a log entry is replicated to the majority, irrespective of whether it got committed by the previous leader or not, the new leader that gets elected would consist the log entry in its log.
  • Any node can vote only once per term.
  • If a node has already voted for a candidate & the same candidate requests the same peer again for vote in the same term, the vote can be granted.
  • Raft never directly commits any entry from the previous term in case the entry is not committed. While committing any entry from the current term, entries belonging to previous terms get committed indirectly.

Take some time to understand these algorithms, If required, re-read. In the next part of this series, we’ll discuss & simulate some cases with Raft simulator which hopefully makes these algorithms very clear to you.

Reference

  1. https://raft.github.io/raft.pdf
  2. https://eli.thegreenplace.net/2020/implementing-raft-part-1-elections/
  3. https://eli.thegreenplace.net/2020/implementing-raft-part-2-commands-and-log-replication/

--

--

Deep discussions on problem solving, distributed systems, computing concepts, real life systems designing. Developer @Uber. https://in.linkedin.com/in/kousikn