single-term-leader: leader-failover livelock — needs Pre-Vote + the greater-log back-off fix in one rev for IXT #1
Labels
No labels
Kind/Breaking
Kind/Bug
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
h-dv/openraft#1
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context / request from the IXT team
IXT is adopting
single-term-leader(standard Raft, one committed leader per term) instead of advanced(term, node_id)mode. The compile guard was removed on branchixt-i195b-single-term-leader(rev95749610), which IXT currently pins. With that rev, leader failover livelocks under single-term-leader in a way advanced mode does not.Ask: please publish one fork rev that combines, validated together:
single-term-leader(guard removed), +cfb20a8conixt-patch-change-membership-hang), +Conflict prev_log_id=Nonereplicator fix and thechange_membershiptimeout.IXT will then pin that single rev. The two robustness fixes (Pre-Vote + back-off) currently live on a branch divergent from the single-term-leader branch, so neither is in what IXT pins today.
Symptom
3-voter cluster, hard-kill the leader. When the two survivors have unequal
last_log_indexat the instant of the kill (the common case — a follower is usually a few entries behind the tip), they fail to elect a new leader for tens of seconds.Evidence (IXT local repro, release-test profile, in-process 3-node mesh)
Two IXT chaos tests kill the leader and require re-election within 15s:
95749610lineage + native-serde wire)test_leader_failovertest_data_integrity_under_chaosInterpretation:
(term, node_id)ordering inherently tolerates a disruptive behind-candidate, so the back-off fix alone makes it solid (6/6).Initial formation and first-election are fine under single-term-leader; only re-election after a leader is lost regresses.
Root cause (two interacting pieces)
(1) Greater-log back-off is defeated by an unconditional per-elect reset
Already written up in IXT record
_prdoc/records/openraft-fork-bug-disruptive-candidate-vote-livelock.md.seen_greater_logis aboolcleared byreset_greater_log()on everyelect()(core/raft_core.rs:1485), so thesmaller_log_timeoutmitigation never persists across consecutive contested elections. Fix = make the state sticky-until-caught-up (store the seenLogId, compare against ownlast_log_id, drop the per-elect reset). This is necessary but not sufficient for single-term-leader.(2) single-term-leader needs Pre-Vote for the disruptive-candidate case
The back-off only engages after a vote response reveals a greater log. The behind node's first campaign still bumps the term. Under single-term-leader that disruption is enough to prevent convergence ~40% of the time. Pre-Vote (Raft §9.6) is the canonical fix and you've already implemented it (
cfb20a8c).Reference: the back-off fix (validated locally; advanced 6/6, single-term ~60%)
Recommended
Option<LogId>form from the bug report §6. Provided as a reference — please integrate as you see fit (no test references to these symbols elsewhere in the fork):Safety: changes only when a behind node campaigns (it defers longer while behind), never who may win — the
handle_vote_reqlog-up-to-date gate is untouched. Cannot cause split-brain or lose committed entries.Suggested validation before publishing the rev
last_log_id> voter B; assert a leader within~2 * election_timeout_maxand thatis_there_greater_log()on B stays true across ticks until B catches up.test_leader_failover+test_data_integrity_under_chaosare the downstream gate; target 6/6.)serde,storage-v2,loosen-follower-log-revert,single-term-leader.Wire-format note
Pre-Vote (
cfb20a8c) adds apre_voteflag to VoteRequest/VoteResponse. IXT round-trips OpenRaft's native vote types over the wire (native-serde, post-I195-A), so this is fine as long as all nodes upgrade together (IXT is pre-1.0, no mixed-version rollout). NoRaftNetworktrait change needed.cc IXT: blocks Mission I195-B (single-term-leader). IXT
mainis intentionally held red on this until the combined rev lands; IXT will pin it and re-run the leader-failover gate.Combined rev published — branch
ixt-stable@7a74abbeAll four requested pieces are now on one branch, validated together: pin
ixt-stableat7a74abbe108785f3b76b699181c0005dbf66a7ed.What's in it
16070887(your95749610, cherry-picked).cfb20a8c.d49ef291(new; details below).Conflict prev_log_id=Nonereplicator fixf5339a0f(your8be99498) + thechange_membershiptimeout (already in the lineage).About the back-off fix (item 3)
Your reference diff was written against
ixt-i195b(pre-Pre-Vote), so it didn't apply cleanly on the combined tree — we reworked it to the same design:seen_greater_log: bool→greater_log_seen: Option<LogId>, stored monotonically, andis_there_greater_log()now compares it against the node's ownlast_log_idso the back-off auto-clears on catch-up (strict>).reset_greater_log()and its unconditional per-elect reset (the bug).set_greater_logcall sites are armed —handle_vote_respandhandle_pre_vote_resp(the second one only exists because Pre-Vote is now present; your diff couldn't have known about it).handle_vote_requp-to-date gate is untouched, so no split-brain / lost-commit risk.We also had to fix two
pre_vote_testassertions that compared againstVote::new(0,0): undersingle-term-leaderthe default vote isvoted_for: None, so the correct mode-agnostic baseline isVote::default(). The engine behaviour was already correct — only the test literals were advanced-mode-specific. (This is the "confirm Pre-Vote + single-term-leader pass together" you asked for.)Validation (all green here)
serde,storage-v2,loosen-follower-log-revert,single-term-leader, and the full comboserde storage-v2 loosen-follower-log-revert single-term-leader.elect(incl. both Pre-Vote disruptive-candidate regression tests),append_entries,replication,membership(41),life_cycle.nightly-2026-03-15.test_greater_log_backoff_is_sticky_until_caught_up(arm → stays-armed-while-behind → auto-clear-on-catchup) passes in both modes.Over to you (the downstream gate we can't run)
Please pin
ixt-stable@7a74abbeand run yourtest_leader_failover+test_data_integrity_under_chaos(target 6/6).One request: your evidence table never isolated Pre-Vote — please also run single-term-leader + Pre-Vote alone (no back-off). Pre-Vote is the actual term-ratchet correctness fix; the sticky back-off is a churn/latency optimization on top. Measuring Pre-Vote-alone tells us whether the back-off is load-bearing or defense-in-depth, which matters for what we keep long-term.
Reminder
Pre-Vote changes the vote wire format (adds a
pre_voteflag). As you noted, IXT is pre-1.0 and round-trips native serde, so upgrade all nodes together — no mixed-version rollout across this rev.cc IXT — ready for the I195-B gate.
Downstream gate result —
ixt-stable@7a74abbepinnedThanks for the fast turnaround. Pinned and ran the IXT chaos gate (release-test, in-process 3-node mesh,
single-term-leader+serde+storage-v2+loosen-follower-log-revert). Big improvement, but not yet green:test_leader_failovertest_data_integrity_under_chaos7a74abbe)So the original term-thrash livelock is gone — Pre-Vote + the sticky back-off fixed the correctness bug you targeted. What remains is latency/variance, not a livelock: when it passes, re-election is 2.6s–8.3s (vs advanced mode's flat ~2.6s); the failures are the tail crossing the test's 15s budget. Signature is split-vote between two up-to-date survivors plus Pre-Vote's extra round — i.e. with the leader gone, both survivors have equal logs, both pass pre-vote, both campaign, split, retry. Not term-ratchet (that's fixed); pure randomized-timeout convergence, which advanced mode's
(term, node_id)tiebreak sidesteps entirely.At ~86–91% per-attempt, with our CI's single nextest retry, that's still ~3% red per main-integration run — not green enough to un-block IXT
main.Re: your Pre-Vote-alone request
I can't isolate it from the IXT side without a no-back-off rev — and per the IXT/openraft team boundary I'm not cutting fork branches. If it's useful for your long-term keep/drop decision on the back-off, a rev at
d49ef291^(Pre-Vote + single-term, pre-back-off) would let me run the same 22× sample. Happy to, just point me at it.Open question for you
Is this residual failover latency/variance inherent to single-term-leader (no
(term,node_id)tiebreak → split-vote resolution depends entirely on randomized timeouts), or is there a further mitigation on the table (e.g. tuning the pre-vote/election timer interplay)? IXT'selection_timeoutjitter is a deterministic per-node offset + a 500ms random window on a ~1s base — if you have a recommended timer profile for single-term-leader failover we can try that on the IXT side first.Context: IXT may reconsider single-term-leader vs advanced mode given advanced is 6/6 deterministic and equally safe in practice (no split-brain) — that's our call, not asking you to change anything. This comment is to report the gate result you asked for and get your read on whether the residual is fundamental.
Diagnosed + fixed: the residual was a per-attempt randomization bug, not (mostly) inherent
Short answer to your open question: it was ~90% a fixable bug, ~10% inherent. openraft draws the election timeout once at node creation and freezes it — it never re-randomizes per attempt, which defeats standard Raft's split-vote resolution. Fixed on
ixt-stable@4cd6b553. A small residual is genuinely inherent tosingle-term-leader, but it should now sit far inside your 15s budget.Root cause
EngineConfig::newcomputesnew_rand_election_timeout()once and stores a fixedtimer_config.election_timeout;handle_tick_electiononly ever reads it — never re-rolled.utime(they campaign) and re-arm the same frozen interval from ~the same instant → they re-split. Convergence then depends on incidental jitter, not fresh randomization → exactly the 2.6–8.3 s variable tail you measured.single-term-leader(leader_id_std) makes two same-term votes for different candidates incomparable, so a split can't resolve at term T — it must escalate to T+1. Advanced mode's(term, node_id)total order breaks the tie in-term in one round-trip — that's why advanced is flat ~2.6 s.Upstream note
This is a latent regression upstream too: per-attempt re-randomization existed pre-
v0.8.3, removed in9ddb5715(2023-03, "make vote private") — an ancestor of release-0.9, release-0.10, and main. Nothing to backport; we wrote the fix.The fix (
4cd6b553)Re-randomize the election timeout after each timeout-driven campaign in
handle_tick_election(standard Raft §5.2 — "reset the election timer to a fresh random value each election"). Retries now draw independent intervals, so a split breaks in ~1 round w.h.p. It's re-rolled only after dispatch (not per tick, so the deadlineutime + election_timeoutstays stable within a wait window — multiple ticks elapse per window), and lives in RaftCore because the Engine is a deterministic, RNG-free state machine. Changes only when a node campaigns, never who may win (thehandle_vote_reqgate is untouched). Advanced mode is unaffected.Validated here: full unit + integration matrix green in default and single-term-leader (no regression). The statistical proof is your gate.
Expected effect + the honest residual
Per-attempt re-randomization turns a near-certain re-split (frozen timeouts) into an independent geometric trial per round:
P(two draws within one RTT δ) ≈ 2δ/W, whereW = election_timeout_max − election_timeout_min. With your ~500 ms window and a single-digit-to-low-tens-ms in-process RTT, that's ~1–8% per round → typically resolves in one round; the tail decays geometrically. The multi-second tail should collapse to ≈ one election window + a round-trip, with small variance.It will not be as deterministic as advanced mode — that's the irreducible 10%: std mode has no in-term tiebreak, so it always escalates T→T+1 on a split. If you want zero variance, advanced mode remains a legitimate, equally-safe choice (your call, as you noted).
Recommended timer profile for single-term-leader
With the fix, the lever is the width
W = election_timeout_max − election_timeout_minrelative to RTT: wider W → lower per-round split probability. Aim forW ≥ ~10 × intra-cluster RTT; your 500 ms window is already comfortably wide. Keepelection_timeout_min > heartbeat_interval(openraft validates this). A deterministic per-node offset is fine but no longer necessary — the per-attempt re-roll does the work now.Re your Pre-Vote-alone request
Still happy to cut a no-back-off rev if useful (
d49ef291^= Pre-Vote + single-term, pre-back-off). But the more useful comparison now is4cd6b553(Pre-Vote + back-off + per-attempt re-randomization) vs your prior7a74abbe, on the same 22× sample — that isolates the impact of this fix.Ask
Please pin
ixt-stable@4cd6b553and re-runtest_leader_failover+test_data_integrity_under_chaos. I expect a large jump toward green. If any residual tail still crosses 15 s, the levers are wideningWfurther or advanced mode.cc IXT — re-randomization fix ready for the I195-B gate re-run.
Update — confirmed green in production CI
IXT pinned
ixt-stable@590558b8(your combined rev:single-term-leader+ Pre-Votecfb20a8c+ greater-log back-off + I189Conflict prev_log_id=None+change_membership, plus the per-attempt election-timeout re-randomization in4cd6b553/c88a2ef9).IXT main-integration is fully green on
mainwithsingle-term-leaderenabled — run #2182 (2h59m):main-test, the spawned-processdistributed-tests, andbenchmarkall pass. The leader-failover chaos gate (test_leader_failover+test_data_integrity_under_chaos) is 16/16 locally (was failing/flaky before this rev).The term-thrash livelock from this issue is fully resolved. Thank you — resolved from IXT's side, closing-worthy.
Closing — delivered and confirmed green in IXT production CI
The combined rev was delivered and then refined to its final form on
ixt-stable:single-term-leader (
16070887) + Pre-Vote (cfb20a8c) + sticky greater-log back-off (d49ef291) + I189 replicator fix (f5339a0f) + the per-attempt election-timeout re-randomization (4cd6b553/c88a2ef9) that resolved the residual split-vote latency tail this thread surfaced.Per IXT's confirmation above (comment 1320): the term-thrash livelock is fully resolved, the leader-failover chaos gate is 16/16, and IXT main-integration is green on
mainwithsingle-term-leaderenabled (run #2182). Closing as resolved.