Pre-Vote: stale grants can promote to a real election after a same-leader heartbeat refreshes the lease #4

Closed
opened 2026-05-23 16:16:36 +02:00 by buildagent · 1 comment
Member

Summary

The current Pre-Vote implementation keeps Engine::pre_candidate alive across same-vote leader heartbeats. That avoids starving a pre-vote round, but it creates a stale-grant window: an old pre-vote quorum can still promote into Engine::elect() after a live leader has refreshed this node's leader lease with AppendEntries.

This weakens the main Pre-Vote guarantee from Raft dissertation §9.6: a node that has recently heard from a valid leader should not start a disruptive real election.

Code pointers

  • openraft/src/engine/handler/vote_handler/mod.rs: update_vote() clears pre_candidate only when vote > self.state.vote_ref(). Same-vote committed leader heartbeats call touch() and leave the pre-vote round intact.
  • openraft/src/engine/engine_impl.rs: handle_pre_vote_resp() promotes on quorum by clearing pre_candidate and calling self.elect() without re-checking current leader lease / election timer freshness.
  • openraft/src/core/raft_core.rs: handle_notify() routes resp.pre_vote to handle_pre_vote_resp() purely by matching the stored prospective vote.

Plausible failure sequence

  1. Follower election timer fires and starts a Pre-Vote round.
  2. One or more peers grant.
  3. Before the last grant arrives, the current leader's AppendEntries/heartbeat arrives with the same committed vote and refreshes vote_last_modified / leader lease.
  4. Because this is a same-vote touch, pre_candidate survives.
  5. A delayed Pre-Vote response arrives and completes quorum.
  6. handle_pre_vote_resp() calls elect() immediately, incrementing/persisting a real vote despite the fresh leader lease.

The current unit/integration tests cover lease rejection when answering Pre-Vote and isolated-follower non-disruption, but they do not cover this stale-response-after-heartbeat interleaving.

Expected behavior

Any accepted AppendEntries from the current leader after a Pre-Vote round starts should make old pre-vote grants unable to trigger a real election unless the election timer/lease has expired again.

Suggested fixes

  • Add a pre_vote_epoch / election timer generation and carry it through Notify::VoteResponse, rejecting responses from old rounds.
  • Or clear pre_candidate on accepted AppendEntries/heartbeat from the current committed leader, but avoid the starvation noted in docs/design/pre-vote-spike.md by restarting only after the next real timeout.
  • At minimum, re-check the local leader lease/election timeout immediately before promotion in handle_pre_vote_resp().

Suggested regression test

Engine-level deterministic test:

  1. Create follower with committed leader vote and expired lease.
  2. Start pre_elect() and receive one grant short of quorum.
  3. Apply same committed leader AppendEntries to refresh lease without changing vote.
  4. Deliver delayed pre-vote grant that would otherwise complete quorum.
  5. Assert no SaveVote, no real SendVote{pre_vote:false}, and state remains follower until the next election timeout.

This is consensus-critical because it can turn Pre-Vote back into a disruptive election under message reordering.

## Summary The current Pre-Vote implementation keeps `Engine::pre_candidate` alive across same-vote leader heartbeats. That avoids starving a pre-vote round, but it creates a stale-grant window: an old pre-vote quorum can still promote into `Engine::elect()` after a live leader has refreshed this node's leader lease with AppendEntries. This weakens the main Pre-Vote guarantee from Raft dissertation §9.6: a node that has recently heard from a valid leader should not start a disruptive real election. ## Code pointers - `openraft/src/engine/handler/vote_handler/mod.rs`: `update_vote()` clears `pre_candidate` only when `vote > self.state.vote_ref()`. Same-vote committed leader heartbeats call `touch()` and leave the pre-vote round intact. - `openraft/src/engine/engine_impl.rs`: `handle_pre_vote_resp()` promotes on quorum by clearing `pre_candidate` and calling `self.elect()` without re-checking current leader lease / election timer freshness. - `openraft/src/core/raft_core.rs`: `handle_notify()` routes `resp.pre_vote` to `handle_pre_vote_resp()` purely by matching the stored prospective vote. ## Plausible failure sequence 1. Follower election timer fires and starts a Pre-Vote round. 2. One or more peers grant. 3. Before the last grant arrives, the current leader's AppendEntries/heartbeat arrives with the same committed vote and refreshes `vote_last_modified` / leader lease. 4. Because this is a same-vote touch, `pre_candidate` survives. 5. A delayed Pre-Vote response arrives and completes quorum. 6. `handle_pre_vote_resp()` calls `elect()` immediately, incrementing/persisting a real vote despite the fresh leader lease. The current unit/integration tests cover lease rejection when answering Pre-Vote and isolated-follower non-disruption, but they do not cover this stale-response-after-heartbeat interleaving. ## Expected behavior Any accepted AppendEntries from the current leader after a Pre-Vote round starts should make old pre-vote grants unable to trigger a real election unless the election timer/lease has expired again. ## Suggested fixes - Add a `pre_vote_epoch` / election timer generation and carry it through `Notify::VoteResponse`, rejecting responses from old rounds. - Or clear `pre_candidate` on accepted AppendEntries/heartbeat from the current committed leader, but avoid the starvation noted in `docs/design/pre-vote-spike.md` by restarting only after the next real timeout. - At minimum, re-check the local leader lease/election timeout immediately before promotion in `handle_pre_vote_resp()`. ## Suggested regression test Engine-level deterministic test: 1. Create follower with committed leader vote and expired lease. 2. Start `pre_elect()` and receive one grant short of quorum. 3. Apply same committed leader AppendEntries to refresh lease without changing vote. 4. Deliver delayed pre-vote grant that would otherwise complete quorum. 5. Assert no `SaveVote`, no real `SendVote{pre_vote:false}`, and state remains follower until the next election timeout. This is consensus-critical because it can turn Pre-Vote back into a disruptive election under message reordering.
Author
Member

Fixed on ixt-stable (CI green, run #2179, 14/14)

4d4d715f — re-check leader lease before promoting a Pre-Vote round

Implemented the "at minimum, re-check the local leader lease/election timeout immediately before promotion" suggestion (the lowest-risk of the three options, no wire/protocol change):

  • Extracted Engine::leader_lease_is_unexpired() from would_grant_vote (step 1, the same lease check used when answering a vote).
  • handle_pre_vote_resp() now calls it immediately before promotion: if a leader lease was refreshed after the pre-vote round started, the stale round is abandoned (pre_candidate = None; return) instead of calling elect(). This closes the stale-grant window from the failure sequence in the issue — a delayed grant can no longer ratchet a real vote after a same-vote leader heartbeat refreshed the lease.

This restores the §9.6 guarantee (a node that recently heard from a valid leader does not start a disruptive real election) without the starvation risk noted in pre-vote-spike.md, since the round is only abandoned, not the timer reset.

Regression tests (openraft/src/engine/tests/pre_vote_test.rs)

  • test_handle_pre_vote_resp_stale_grant_blocked_by_refreshed_lease — the exact 5-step interleaving from the issue (grant short of quorum → same-leader heartbeat refreshes lease → delayed grant completes quorum); asserts no promotion / no real vote.
  • test_handle_pre_vote_resp_promotes_when_lease_genuinely_expired — positive liveness: when the lease has expired, quorum still promotes normally.

Closing as fixed.

## Fixed on `ixt-stable` (CI green, run #2179, 14/14) ### `4d4d715f` — re-check leader lease before promoting a Pre-Vote round Implemented the "at minimum, re-check the local leader lease/election timeout immediately before promotion" suggestion (the lowest-risk of the three options, no wire/protocol change): - Extracted `Engine::leader_lease_is_unexpired()` from `would_grant_vote` (step 1, the same lease check used when *answering* a vote). - `handle_pre_vote_resp()` now calls it immediately before promotion: if a leader lease was refreshed after the pre-vote round started, the stale round is abandoned (`pre_candidate = None; return`) instead of calling `elect()`. This closes the stale-grant window from the failure sequence in the issue — a delayed grant can no longer ratchet a real vote after a same-vote leader heartbeat refreshed the lease. This restores the §9.6 guarantee (a node that recently heard from a valid leader does not start a disruptive real election) without the starvation risk noted in `pre-vote-spike.md`, since the round is only abandoned, not the timer reset. ### Regression tests (`openraft/src/engine/tests/pre_vote_test.rs`) - `test_handle_pre_vote_resp_stale_grant_blocked_by_refreshed_lease` — the exact 5-step interleaving from the issue (grant short of quorum → same-leader heartbeat refreshes lease → delayed grant completes quorum); asserts **no** promotion / no real vote. - `test_handle_pre_vote_resp_promotes_when_lease_genuinely_expired` — positive liveness: when the lease *has* expired, quorum still promotes normally. Closing as fixed.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
h-dv/openraft#4
No description provided.