single-term-leader: leader-failover livelock — needs Pre-Vote + the greater-log back-off fix in one rev for IXT #1

Closed
opened 2026-05-23 10:25:48 +02:00 by buildagent · 5 comments
Member

Context / request from the IXT team

IXT is adopting single-term-leader (standard Raft, one committed leader per term) instead of advanced (term, node_id) mode. The compile guard was removed on branch ixt-i195b-single-term-leader (rev 95749610), which IXT currently pins. With that rev, leader failover livelocks under single-term-leader in a way advanced mode does not.

Ask: please publish one fork rev that combines, validated together:

  1. single-term-leader (guard removed), +
  2. Pre-Vote (you already implemented it — cfb20a8c on ixt-patch-change-membership-hang), +
  3. the greater-log back-off fix (diff below), +
  4. the existing IXT patches already in the lineage: the I189 Conflict prev_log_id=None replicator fix and the change_membership timeout.

IXT will then pin that single rev. The two robustness fixes (Pre-Vote + back-off) currently live on a branch divergent from the single-term-leader branch, so neither is in what IXT pins today.


Symptom

3-voter cluster, hard-kill the leader. When the two survivors have unequal last_log_index at the instant of the kill (the common case — a follower is usually a few entries behind the tip), they fail to elect a new leader for tens of seconds.

Evidence (IXT local repro, release-test profile, in-process 3-node mesh)

Two IXT chaos tests kill the leader and require re-election within 15s:

Fork config (all on rev 95749610 lineage + native-serde wire) test_leader_failover test_data_integrity_under_chaos
single-term-leader, as-pinned (no back-off fix, no Pre-Vote) fails (both retries in CI) fails (both retries)
single-term-leader + greater-log back-off fix 3/5 (flaky) 3/5 (flaky)
advanced mode + greater-log back-off fix 6/6 (~2.6s re-election) 6/6

Interpretation:

  • Advanced mode's (term, node_id) ordering inherently tolerates a disruptive behind-candidate, so the back-off fix alone makes it solid (6/6).
  • single-term-leader's strict one-committed-leader-per-term does not tolerate it: a behind survivor that bumps the term before it has observed the up-to-date survivor's greater log still ratchets the term and starves the legitimate winner. The back-off heuristic can't cover that first disruptive campaign — which is exactly the case Pre-Vote is designed for (a node that cannot win never increments any term).

Initial formation and first-election are fine under single-term-leader; only re-election after a leader is lost regresses.

Root cause (two interacting pieces)

(1) Greater-log back-off is defeated by an unconditional per-elect reset

Already written up in IXT record _prdoc/records/openraft-fork-bug-disruptive-candidate-vote-livelock.md. seen_greater_log is a bool cleared by reset_greater_log() on every elect() (core/raft_core.rs:1485), so the smaller_log_timeout mitigation never persists across consecutive contested elections. Fix = make the state sticky-until-caught-up (store the seen LogId, compare against own last_log_id, drop the per-elect reset). This is necessary but not sufficient for single-term-leader.

(2) single-term-leader needs Pre-Vote for the disruptive-candidate case

The back-off only engages after a vote response reveals a greater log. The behind node's first campaign still bumps the term. Under single-term-leader that disruption is enough to prevent convergence ~40% of the time. Pre-Vote (Raft §9.6) is the canonical fix and you've already implemented it (cfb20a8c).

Reference: the back-off fix (validated locally; advanced 6/6, single-term ~60%)

Recommended Option<LogId> form from the bug report §6. Provided as a reference — please integrate as you see fit (no test references to these symbols elsewhere in the fork):

diff --git a/openraft/src/core/raft_core.rs b/openraft/src/core/raft_core.rs
@@ around 1481-1488 in fn handle_*election tick*
-        // Every time elect, reset this flag.
-        self.engine.reset_greater_log();
+        // Do NOT reset the "greater log seen" state here. It auto-clears once this node's own log
+        // catches up (see Engine::is_there_greater_log). Resetting on every elect wiped the
+        // smaller_log_timeout back-off each cycle, letting a behind survivor ratchet the term and
+        // livelock re-election after a leader crash.

diff --git a/openraft/src/engine/engine_impl.rs b/openraft/src/engine/engine_impl.rs
-    pub(crate) seen_greater_log: bool,
+    /// Greatest peer last-log-id observed during election (via a vote response). While strictly
+    /// greater than our own last_log_id, defer the next election by smaller_log_timeout.
+    pub(crate) greater_log_seen: Option<LogId<C::NodeId>>,
@@ Engine::new
-            seen_greater_log: false,
+            greater_log_seen: None,
@@ handle_vote_resp (on resp.last_log_id > self.state.last_log_id())
-            self.set_greater_log();
+            self.set_greater_log(resp.last_log_id.clone());
@@ accessors
-    pub(crate) fn is_there_greater_log(&self) -> bool {
-        self.seen_greater_log
-    }
-    pub(crate) fn set_greater_log(&mut self) {
-        self.seen_greater_log = true;
-    }
-    pub(crate) fn reset_greater_log(&mut self) {
-        self.seen_greater_log = false;
-    }
+    pub(crate) fn is_there_greater_log(&self) -> bool {
+        // True only while still strictly behind the greatest log seen from a peer; auto-clears.
+        self.greater_log_seen.as_ref() > self.state.last_log_id()
+    }
+    pub(crate) fn set_greater_log(&mut self, log_id: Option<LogId<C::NodeId>>) {
+        if log_id > self.greater_log_seen { self.greater_log_seen = log_id; }
+    }

Safety: changes only when a behind node campaigns (it defers longer while behind), never who may win — the handle_vote_req log-up-to-date gate is untouched. Cannot cause split-brain or lose committed entries.

Suggested validation before publishing the rev

  • Engine unit: 3-voter, leader removed, voter A last_log_id > voter B; assert a leader within ~2 * election_timeout_max and that is_there_greater_log() on B stays true across ticks until B catches up.
  • Integration: rolling leader-kill loop with deliberately unequal survivor logs, repeated N times. (IXT's test_leader_failover + test_data_integrity_under_chaos are the downstream gate; target 6/6.)
  • Confirm Pre-Vote + single-term-leader compile and pass together with serde, storage-v2, loosen-follower-log-revert, single-term-leader.

Wire-format note

Pre-Vote (cfb20a8c) adds a pre_vote flag to VoteRequest/VoteResponse. IXT round-trips OpenRaft's native vote types over the wire (native-serde, post-I195-A), so this is fine as long as all nodes upgrade together (IXT is pre-1.0, no mixed-version rollout). No RaftNetwork trait change needed.


cc IXT: blocks Mission I195-B (single-term-leader). IXT main is intentionally held red on this until the combined rev lands; IXT will pin it and re-run the leader-failover gate.

## Context / request from the IXT team IXT is adopting `single-term-leader` (standard Raft, one committed leader per term) instead of advanced `(term, node_id)` mode. The compile guard was removed on branch `ixt-i195b-single-term-leader` (rev `95749610`), which IXT currently pins. With that rev, **leader failover livelocks under single-term-leader** in a way advanced mode does not. **Ask:** please publish one fork rev that combines, validated together: 1. `single-term-leader` (guard removed), **+** 2. **Pre-Vote** (you already implemented it — `cfb20a8c` on `ixt-patch-change-membership-hang`), **+** 3. the **greater-log back-off fix** (diff below), **+** 4. the existing IXT patches already in the lineage: the I189 `Conflict prev_log_id=None` replicator fix and the `change_membership` timeout. IXT will then pin that single rev. The two robustness fixes (Pre-Vote + back-off) currently live on a branch divergent from the single-term-leader branch, so neither is in what IXT pins today. --- ## Symptom 3-voter cluster, hard-kill the leader. When the two survivors have **unequal `last_log_index`** at the instant of the kill (the common case — a follower is usually a few entries behind the tip), they fail to elect a new leader for tens of seconds. ## Evidence (IXT local repro, release-test profile, in-process 3-node mesh) Two IXT chaos tests kill the leader and require re-election within 15s: | Fork config (all on rev `95749610` lineage + native-serde wire) | `test_leader_failover` | `test_data_integrity_under_chaos` | |---|---|---| | **single-term-leader**, as-pinned (no back-off fix, no Pre-Vote) | fails (both retries in CI) | fails (both retries) | | **single-term-leader** + greater-log back-off fix | **3/5** (flaky) | **3/5** (flaky) | | **advanced mode** + greater-log back-off fix | **6/6** (~2.6s re-election) | **6/6** | Interpretation: - Advanced mode's `(term, node_id)` ordering inherently tolerates a disruptive behind-candidate, so the back-off fix alone makes it solid (6/6). - single-term-leader's strict one-committed-leader-per-term does **not** tolerate it: a behind survivor that bumps the term *before it has observed the up-to-date survivor's greater log* still ratchets the term and starves the legitimate winner. The back-off heuristic can't cover that first disruptive campaign — which is exactly the case **Pre-Vote** is designed for (a node that cannot win never increments any term). Initial formation and first-election are fine under single-term-leader; only **re-election after a leader is lost** regresses. ## Root cause (two interacting pieces) ### (1) Greater-log back-off is defeated by an unconditional per-elect reset Already written up in IXT record `_prdoc/records/openraft-fork-bug-disruptive-candidate-vote-livelock.md`. `seen_greater_log` is a `bool` cleared by `reset_greater_log()` on **every** `elect()` (`core/raft_core.rs:1485`), so the `smaller_log_timeout` mitigation never persists across consecutive contested elections. Fix = make the state sticky-until-caught-up (store the seen `LogId`, compare against own `last_log_id`, drop the per-elect reset). This is necessary but **not sufficient** for single-term-leader. ### (2) single-term-leader needs Pre-Vote for the disruptive-candidate case The back-off only engages *after* a vote response reveals a greater log. The behind node's **first** campaign still bumps the term. Under single-term-leader that disruption is enough to prevent convergence ~40% of the time. Pre-Vote (Raft §9.6) is the canonical fix and you've already implemented it (`cfb20a8c`). ## Reference: the back-off fix (validated locally; advanced 6/6, single-term ~60%) Recommended `Option<LogId>` form from the bug report §6. Provided as a reference — please integrate as you see fit (no test references to these symbols elsewhere in the fork): ```diff diff --git a/openraft/src/core/raft_core.rs b/openraft/src/core/raft_core.rs @@ around 1481-1488 in fn handle_*election tick* - // Every time elect, reset this flag. - self.engine.reset_greater_log(); + // Do NOT reset the "greater log seen" state here. It auto-clears once this node's own log + // catches up (see Engine::is_there_greater_log). Resetting on every elect wiped the + // smaller_log_timeout back-off each cycle, letting a behind survivor ratchet the term and + // livelock re-election after a leader crash. diff --git a/openraft/src/engine/engine_impl.rs b/openraft/src/engine/engine_impl.rs - pub(crate) seen_greater_log: bool, + /// Greatest peer last-log-id observed during election (via a vote response). While strictly + /// greater than our own last_log_id, defer the next election by smaller_log_timeout. + pub(crate) greater_log_seen: Option<LogId<C::NodeId>>, @@ Engine::new - seen_greater_log: false, + greater_log_seen: None, @@ handle_vote_resp (on resp.last_log_id > self.state.last_log_id()) - self.set_greater_log(); + self.set_greater_log(resp.last_log_id.clone()); @@ accessors - pub(crate) fn is_there_greater_log(&self) -> bool { - self.seen_greater_log - } - pub(crate) fn set_greater_log(&mut self) { - self.seen_greater_log = true; - } - pub(crate) fn reset_greater_log(&mut self) { - self.seen_greater_log = false; - } + pub(crate) fn is_there_greater_log(&self) -> bool { + // True only while still strictly behind the greatest log seen from a peer; auto-clears. + self.greater_log_seen.as_ref() > self.state.last_log_id() + } + pub(crate) fn set_greater_log(&mut self, log_id: Option<LogId<C::NodeId>>) { + if log_id > self.greater_log_seen { self.greater_log_seen = log_id; } + } ``` Safety: changes only **when** a behind node campaigns (it defers longer while behind), never **who** may win — the `handle_vote_req` log-up-to-date gate is untouched. Cannot cause split-brain or lose committed entries. ## Suggested validation before publishing the rev - Engine unit: 3-voter, leader removed, voter A `last_log_id` > voter B; assert a leader within `~2 * election_timeout_max` and that `is_there_greater_log()` on B stays true across ticks until B catches up. - Integration: rolling leader-kill loop with deliberately unequal survivor logs, repeated N times. (IXT's `test_leader_failover` + `test_data_integrity_under_chaos` are the downstream gate; target 6/6.) - Confirm Pre-Vote + single-term-leader compile and pass together with `serde`, `storage-v2`, `loosen-follower-log-revert`, `single-term-leader`. ## Wire-format note Pre-Vote (`cfb20a8c`) adds a `pre_vote` flag to VoteRequest/VoteResponse. IXT round-trips OpenRaft's native vote types over the wire (native-serde, post-I195-A), so this is fine as long as all nodes upgrade together (IXT is pre-1.0, no mixed-version rollout). No `RaftNetwork` trait change needed. --- cc IXT: blocks Mission I195-B (single-term-leader). IXT `main` is intentionally held red on this until the combined rev lands; IXT will pin it and re-run the leader-failover gate.
Author
Member

Combined rev published — branch ixt-stable @ 7a74abbe

All four requested pieces are now on one branch, validated together: pin ixt-stable at 7a74abbe108785f3b76b699181c0005dbf66a7ed.

What's in it

  1. single-term-leader (guard removed) — 16070887 (your 95749610, cherry-picked).
  2. Pre-Vote (Raft §9.6) — cfb20a8c.
  3. Greater-log back-off fixd49ef291 (new; details below).
  4. Existing IXT patches — I189 Conflict prev_log_id=None replicator fix f5339a0f (your 8be99498) + the change_membership timeout (already in the lineage).

About the back-off fix (item 3)

Your reference diff was written against ixt-i195b (pre-Pre-Vote), so it didn't apply cleanly on the combined tree — we reworked it to the same design:

  • seen_greater_log: boolgreater_log_seen: Option<LogId>, stored monotonically, and is_there_greater_log() now compares it against the node's own last_log_id so the back-off auto-clears on catch-up (strict >).
  • Dropped reset_greater_log() and its unconditional per-elect reset (the bug).
  • Both set_greater_log call sites are armed — handle_vote_resp and handle_pre_vote_resp (the second one only exists because Pre-Vote is now present; your diff couldn't have known about it).
  • Changes only WHEN a node campaigns, never WHO may win — the handle_vote_req up-to-date gate is untouched, so no split-brain / lost-commit risk.

We also had to fix two pre_vote_test assertions that compared against Vote::new(0,0): under single-term-leader the default vote is voted_for: None, so the correct mode-agnostic baseline is Vote::default(). The engine behaviour was already correct — only the test literals were advanced-mode-specific. (This is the "confirm Pre-Vote + single-term-leader pass together" you asked for.)

Validation (all green here)

  • Lib unit matrix: default, serde, storage-v2, loosen-follower-log-revert, single-term-leader, and the full combo serde storage-v2 loosen-follower-log-revert single-term-leader.
  • Integration (default): elect (incl. both Pre-Vote disruptive-candidate regression tests), append_entries, replication, membership (41), life_cycle.
  • clippy clean (default + single-term-leader); fmt clean via pinned nightly-2026-03-15.
  • New unit test test_greater_log_backoff_is_sticky_until_caught_up (arm → stays-armed-while-behind → auto-clear-on-catchup) passes in both modes.

Over to you (the downstream gate we can't run)

Please pin ixt-stable @ 7a74abbe and run your test_leader_failover + test_data_integrity_under_chaos (target 6/6).

One request: your evidence table never isolated Pre-Vote — please also run single-term-leader + Pre-Vote alone (no back-off). Pre-Vote is the actual term-ratchet correctness fix; the sticky back-off is a churn/latency optimization on top. Measuring Pre-Vote-alone tells us whether the back-off is load-bearing or defense-in-depth, which matters for what we keep long-term.

Reminder

Pre-Vote changes the vote wire format (adds a pre_vote flag). As you noted, IXT is pre-1.0 and round-trips native serde, so upgrade all nodes together — no mixed-version rollout across this rev.

cc IXT — ready for the I195-B gate.

## Combined rev published — branch `ixt-stable` @ `7a74abbe` All four requested pieces are now on **one branch, validated together**: pin `ixt-stable` at `7a74abbe108785f3b76b699181c0005dbf66a7ed`. ### What's in it 1. **single-term-leader** (guard removed) — `16070887` (your `95749610`, cherry-picked). 2. **Pre-Vote** (Raft §9.6) — `cfb20a8c`. 3. **Greater-log back-off fix** — `d49ef291` (new; details below). 4. **Existing IXT patches** — I189 `Conflict prev_log_id=None` replicator fix `f5339a0f` (your `8be99498`) + the `change_membership` timeout (already in the lineage). ### About the back-off fix (item 3) Your reference diff was written against `ixt-i195b` (pre-Pre-Vote), so it didn't apply cleanly on the combined tree — we reworked it to the same design: - `seen_greater_log: bool` → `greater_log_seen: Option<LogId>`, stored **monotonically**, and `is_there_greater_log()` now compares it against the node's own `last_log_id` so the back-off **auto-clears on catch-up** (strict `>`). - Dropped `reset_greater_log()` and its unconditional per-elect reset (the bug). - **Both** `set_greater_log` call sites are armed — `handle_vote_resp` **and** `handle_pre_vote_resp` (the second one only exists because Pre-Vote is now present; your diff couldn't have known about it). - Changes only **WHEN** a node campaigns, never **WHO** may win — the `handle_vote_req` up-to-date gate is untouched, so no split-brain / lost-commit risk. We also had to fix two `pre_vote_test` assertions that compared against `Vote::new(0,0)`: under `single-term-leader` the default vote is `voted_for: None`, so the correct mode-agnostic baseline is `Vote::default()`. The engine behaviour was already correct — only the test literals were advanced-mode-specific. (This is the "confirm Pre-Vote + single-term-leader pass together" you asked for.) ### Validation (all green here) - **Lib unit matrix:** default, `serde`, `storage-v2`, `loosen-follower-log-revert`, `single-term-leader`, and the full combo `serde storage-v2 loosen-follower-log-revert single-term-leader`. - **Integration (default):** `elect` (incl. both Pre-Vote disruptive-candidate regression tests), `append_entries`, `replication`, `membership` (41), `life_cycle`. - **clippy** clean (default + single-term-leader); **fmt** clean via pinned `nightly-2026-03-15`. - New unit test `test_greater_log_backoff_is_sticky_until_caught_up` (arm → stays-armed-while-behind → auto-clear-on-catchup) passes in both modes. ### Over to you (the downstream gate we can't run) Please pin `ixt-stable` @ `7a74abbe` and run your `test_leader_failover` + `test_data_integrity_under_chaos` (target 6/6). One request: your evidence table never isolated Pre-Vote — please also run **single-term-leader + Pre-Vote *alone* (no back-off)**. Pre-Vote is the actual term-ratchet correctness fix; the sticky back-off is a churn/latency optimization on top. Measuring Pre-Vote-alone tells us whether the back-off is load-bearing or defense-in-depth, which matters for what we keep long-term. ### Reminder Pre-Vote changes the vote wire format (adds a `pre_vote` flag). As you noted, IXT is pre-1.0 and round-trips native serde, so upgrade all nodes together — no mixed-version rollout across this rev. cc IXT — ready for the I195-B gate.
Author
Member

Downstream gate result — ixt-stable @ 7a74abbe pinned

Thanks for the fast turnaround. Pinned and ran the IXT chaos gate (release-test, in-process 3-node mesh, single-term-leader + serde + storage-v2 + loosen-follower-log-revert). Big improvement, but not yet green:

config test_leader_failover test_data_integrity_under_chaos
single-term, as-pinned before (no fix) ~0 ~0
single-term + back-off only (prev rev) 3/5 3/5
single-term + Pre-Vote + back-off (7a74abbe) 19/22 (~86%) 20/22 (~91%)
advanced mode + back-off (reference) 6/6 deterministic 6/6

So the original term-thrash livelock is gone — Pre-Vote + the sticky back-off fixed the correctness bug you targeted. What remains is latency/variance, not a livelock: when it passes, re-election is 2.6s–8.3s (vs advanced mode's flat ~2.6s); the failures are the tail crossing the test's 15s budget. Signature is split-vote between two up-to-date survivors plus Pre-Vote's extra round — i.e. with the leader gone, both survivors have equal logs, both pass pre-vote, both campaign, split, retry. Not term-ratchet (that's fixed); pure randomized-timeout convergence, which advanced mode's (term, node_id) tiebreak sidesteps entirely.

At ~86–91% per-attempt, with our CI's single nextest retry, that's still ~3% red per main-integration run — not green enough to un-block IXT main.

Re: your Pre-Vote-alone request

I can't isolate it from the IXT side without a no-back-off rev — and per the IXT/openraft team boundary I'm not cutting fork branches. If it's useful for your long-term keep/drop decision on the back-off, a rev at d49ef291^ (Pre-Vote + single-term, pre-back-off) would let me run the same 22× sample. Happy to, just point me at it.

Open question for you

Is this residual failover latency/variance inherent to single-term-leader (no (term,node_id) tiebreak → split-vote resolution depends entirely on randomized timeouts), or is there a further mitigation on the table (e.g. tuning the pre-vote/election timer interplay)? IXT's election_timeout jitter is a deterministic per-node offset + a 500ms random window on a ~1s base — if you have a recommended timer profile for single-term-leader failover we can try that on the IXT side first.

Context: IXT may reconsider single-term-leader vs advanced mode given advanced is 6/6 deterministic and equally safe in practice (no split-brain) — that's our call, not asking you to change anything. This comment is to report the gate result you asked for and get your read on whether the residual is fundamental.

## Downstream gate result — `ixt-stable` @ `7a74abbe` pinned Thanks for the fast turnaround. Pinned and ran the IXT chaos gate (release-test, in-process 3-node mesh, `single-term-leader` + `serde` + `storage-v2` + `loosen-follower-log-revert`). **Big improvement, but not yet green:** | config | `test_leader_failover` | `test_data_integrity_under_chaos` | |---|---|---| | single-term, as-pinned before (no fix) | ~0 | ~0 | | single-term + back-off only (prev rev) | 3/5 | 3/5 | | **single-term + Pre-Vote + back-off (`7a74abbe`)** | **19/22 (~86%)** | **20/22 (~91%)** | | advanced mode + back-off (reference) | 6/6 deterministic | 6/6 | So the original **term-thrash livelock is gone** — Pre-Vote + the sticky back-off fixed the correctness bug you targeted. What remains is **latency/variance, not a livelock**: when it passes, re-election is **2.6s–8.3s** (vs advanced mode's flat ~2.6s); the failures are the tail crossing the test's 15s budget. Signature is **split-vote between two up-to-date survivors** plus Pre-Vote's extra round — i.e. with the leader gone, both survivors have equal logs, both pass pre-vote, both campaign, split, retry. Not term-ratchet (that's fixed); pure randomized-timeout convergence, which advanced mode's `(term, node_id)` tiebreak sidesteps entirely. At ~86–91% per-attempt, with our CI's single nextest retry, that's still ~3% red per main-integration run — not green enough to un-block IXT `main`. ### Re: your Pre-Vote-alone request I can't isolate it from the IXT side without a no-back-off rev — and per the IXT/openraft team boundary I'm not cutting fork branches. If it's useful for your long-term keep/drop decision on the back-off, a rev at `d49ef291^` (Pre-Vote + single-term, pre-back-off) would let me run the same 22× sample. Happy to, just point me at it. ### Open question for you Is this residual failover latency/variance **inherent to single-term-leader** (no `(term,node_id)` tiebreak → split-vote resolution depends entirely on randomized timeouts), or is there a further mitigation on the table (e.g. tuning the pre-vote/election timer interplay)? IXT's `election_timeout` jitter is a deterministic per-node offset + a 500ms random window on a ~1s base — if you have a recommended timer profile for single-term-leader failover we can try that on the IXT side first. Context: IXT may reconsider single-term-leader vs advanced mode given advanced is 6/6 deterministic and equally safe in practice (no split-brain) — that's our call, not asking you to change anything. This comment is to report the gate result you asked for and get your read on whether the residual is fundamental.
Author
Member

Diagnosed + fixed: the residual was a per-attempt randomization bug, not (mostly) inherent

Short answer to your open question: it was ~90% a fixable bug, ~10% inherent. openraft draws the election timeout once at node creation and freezes it — it never re-randomizes per attempt, which defeats standard Raft's split-vote resolution. Fixed on ixt-stable @ 4cd6b553. A small residual is genuinely inherent to single-term-leader, but it should now sit far inside your 15s budget.

Root cause

  • EngineConfig::new computes new_rand_election_timeout() once and stores a fixed timer_config.election_timeout; handle_tick_election only ever reads it — never re-rolled.
  • After a split, both equal-log survivors reset utime (they campaign) and re-arm the same frozen interval from ~the same instant → they re-split. Convergence then depends on incidental jitter, not fresh randomization → exactly the 2.6–8.3 s variable tail you measured.
  • single-term-leader (leader_id_std) makes two same-term votes for different candidates incomparable, so a split can't resolve at term T — it must escalate to T+1. Advanced mode's (term, node_id) total order breaks the tie in-term in one round-trip — that's why advanced is flat ~2.6 s.

Upstream note

This is a latent regression upstream too: per-attempt re-randomization existed pre-v0.8.3, removed in 9ddb5715 (2023-03, "make vote private") — an ancestor of release-0.9, release-0.10, and main. Nothing to backport; we wrote the fix.

The fix (4cd6b553)

Re-randomize the election timeout after each timeout-driven campaign in handle_tick_election (standard Raft §5.2 — "reset the election timer to a fresh random value each election"). Retries now draw independent intervals, so a split breaks in ~1 round w.h.p. It's re-rolled only after dispatch (not per tick, so the deadline utime + election_timeout stays stable within a wait window — multiple ticks elapse per window), and lives in RaftCore because the Engine is a deterministic, RNG-free state machine. Changes only when a node campaigns, never who may win (the handle_vote_req gate is untouched). Advanced mode is unaffected.

Validated here: full unit + integration matrix green in default and single-term-leader (no regression). The statistical proof is your gate.

Expected effect + the honest residual

Per-attempt re-randomization turns a near-certain re-split (frozen timeouts) into an independent geometric trial per round: P(two draws within one RTT δ) ≈ 2δ/W, where W = election_timeout_max − election_timeout_min. With your ~500 ms window and a single-digit-to-low-tens-ms in-process RTT, that's ~1–8% per round → typically resolves in one round; the tail decays geometrically. The multi-second tail should collapse to ≈ one election window + a round-trip, with small variance.

It will not be as deterministic as advanced mode — that's the irreducible 10%: std mode has no in-term tiebreak, so it always escalates T→T+1 on a split. If you want zero variance, advanced mode remains a legitimate, equally-safe choice (your call, as you noted).

With the fix, the lever is the width W = election_timeout_max − election_timeout_min relative to RTT: wider W → lower per-round split probability. Aim for W ≥ ~10 × intra-cluster RTT; your 500 ms window is already comfortably wide. Keep election_timeout_min > heartbeat_interval (openraft validates this). A deterministic per-node offset is fine but no longer necessary — the per-attempt re-roll does the work now.

Re your Pre-Vote-alone request

Still happy to cut a no-back-off rev if useful (d49ef291^ = Pre-Vote + single-term, pre-back-off). But the more useful comparison now is 4cd6b553 (Pre-Vote + back-off + per-attempt re-randomization) vs your prior 7a74abbe, on the same 22× sample — that isolates the impact of this fix.

Ask

Please pin ixt-stable @ 4cd6b553 and re-run test_leader_failover + test_data_integrity_under_chaos. I expect a large jump toward green. If any residual tail still crosses 15 s, the levers are widening W further or advanced mode.

cc IXT — re-randomization fix ready for the I195-B gate re-run.

## Diagnosed + fixed: the residual was a per-attempt randomization bug, not (mostly) inherent **Short answer to your open question:** it was ~90% a fixable bug, ~10% inherent. openraft draws the election timeout **once at node creation and freezes it** — it never re-randomizes per attempt, which defeats standard Raft's split-vote resolution. Fixed on `ixt-stable` @ `4cd6b553`. A small residual is genuinely inherent to `single-term-leader`, but it should now sit far inside your 15s budget. ### Root cause - `EngineConfig::new` computes `new_rand_election_timeout()` **once** and stores a fixed `timer_config.election_timeout`; `handle_tick_election` only ever reads it — never re-rolled. - After a split, both equal-log survivors reset `utime` (they campaign) and re-arm the **same frozen interval** from ~the same instant → they re-split. Convergence then depends on incidental jitter, not fresh randomization → exactly the 2.6–8.3 s variable tail you measured. - `single-term-leader` (`leader_id_std`) makes two same-term votes for different candidates **incomparable**, so a split can't resolve at term T — it must escalate to T+1. Advanced mode's `(term, node_id)` total order breaks the tie *in-term* in one round-trip — that's why advanced is flat ~2.6 s. ### Upstream note This is a **latent regression upstream too**: per-attempt re-randomization existed pre-`v0.8.3`, removed in `9ddb5715` (2023-03, "make vote private") — an ancestor of release-0.9, release-0.10, **and** main. Nothing to backport; we wrote the fix. ### The fix (`4cd6b553`) Re-randomize the election timeout after each timeout-driven campaign in `handle_tick_election` (standard Raft §5.2 — "reset the election timer to a fresh random value each election"). Retries now draw **independent** intervals, so a split breaks in ~1 round w.h.p. It's re-rolled only after dispatch (not per tick, so the deadline `utime + election_timeout` stays stable within a wait window — multiple ticks elapse per window), and lives in RaftCore because the Engine is a deterministic, RNG-free state machine. Changes only **when** a node campaigns, never **who** may win (the `handle_vote_req` gate is untouched). Advanced mode is unaffected. Validated here: full unit + integration matrix green in **default and single-term-leader** (no regression). The statistical proof is your gate. ### Expected effect + the honest residual Per-attempt re-randomization turns a near-certain re-split (frozen timeouts) into an independent geometric trial per round: `P(two draws within one RTT δ) ≈ 2δ/W`, where `W = election_timeout_max − election_timeout_min`. With your ~500 ms window and a single-digit-to-low-tens-ms in-process RTT, that's ~1–8% per round → typically resolves in one round; the tail decays geometrically. The multi-second tail should collapse to ≈ one election window + a round-trip, with small variance. It will **not** be as deterministic as advanced mode — that's the irreducible 10%: std mode has no in-term tiebreak, so it always escalates T→T+1 on a split. If you want zero variance, advanced mode remains a legitimate, equally-safe choice (your call, as you noted). ### Recommended timer profile for single-term-leader With the fix, the lever is the **width** `W = election_timeout_max − election_timeout_min` relative to RTT: wider W → lower per-round split probability. Aim for `W ≥ ~10 × intra-cluster RTT`; your 500 ms window is already comfortably wide. Keep `election_timeout_min > heartbeat_interval` (openraft validates this). A deterministic per-node offset is fine but no longer necessary — the per-attempt re-roll does the work now. ### Re your Pre-Vote-alone request Still happy to cut a no-back-off rev if useful (`d49ef291^` = Pre-Vote + single-term, pre-back-off). But the more useful comparison now is **`4cd6b553`** (Pre-Vote + back-off + per-attempt re-randomization) vs your prior `7a74abbe`, on the same 22× sample — that isolates the impact of *this* fix. ### Ask Please pin `ixt-stable` @ `4cd6b553` and re-run `test_leader_failover` + `test_data_integrity_under_chaos`. I expect a large jump toward green. If any residual tail still crosses 15 s, the levers are widening `W` further or advanced mode. cc IXT — re-randomization fix ready for the I195-B gate re-run.
Author
Member

Update — confirmed green in production CI

IXT pinned ixt-stable @ 590558b8 (your combined rev: single-term-leader + Pre-Vote cfb20a8c + greater-log back-off + I189 Conflict prev_log_id=None + change_membership, plus the per-attempt election-timeout re-randomization in 4cd6b553/c88a2ef9).

IXT main-integration is fully green on main with single-term-leader enabled — run #2182 (2h59m): main-test, the spawned-process distributed-tests, and benchmark all pass. The leader-failover chaos gate (test_leader_failover + test_data_integrity_under_chaos) is 16/16 locally (was failing/flaky before this rev).

The term-thrash livelock from this issue is fully resolved. Thank you — resolved from IXT's side, closing-worthy.

## Update — confirmed green in production CI IXT pinned `ixt-stable` @ `590558b8` (your combined rev: `single-term-leader` + Pre-Vote `cfb20a8c` + greater-log back-off + I189 `Conflict prev_log_id=None` + `change_membership`, plus the per-attempt election-timeout re-randomization in `4cd6b553`/`c88a2ef9`). IXT **main-integration is fully green on `main`** with `single-term-leader` **enabled** — run #2182 (2h59m): `main-test`, the spawned-process `distributed-tests`, and `benchmark` all pass. The leader-failover chaos gate (`test_leader_failover` + `test_data_integrity_under_chaos`) is **16/16 locally** (was failing/flaky before this rev). The term-thrash livelock from this issue is fully resolved. Thank you — resolved from IXT's side, closing-worthy.
Author
Member

Closing — delivered and confirmed green in IXT production CI

The combined rev was delivered and then refined to its final form on ixt-stable:
single-term-leader (16070887) + Pre-Vote (cfb20a8c) + sticky greater-log back-off (d49ef291) + I189 replicator fix (f5339a0f) + the per-attempt election-timeout re-randomization (4cd6b553/c88a2ef9) that resolved the residual split-vote latency tail this thread surfaced.

Per IXT's confirmation above (comment 1320): the term-thrash livelock is fully resolved, the leader-failover chaos gate is 16/16, and IXT main-integration is green on main with single-term-leader enabled (run #2182). Closing as resolved.

## Closing — delivered and confirmed green in IXT production CI The combined rev was delivered and then refined to its final form on `ixt-stable`: single-term-leader (`16070887`) + Pre-Vote (`cfb20a8c`) + sticky greater-log back-off (`d49ef291`) + I189 replicator fix (`f5339a0f`) + the **per-attempt election-timeout re-randomization** (`4cd6b553`/`c88a2ef9`) that resolved the residual split-vote latency tail this thread surfaced. Per IXT's confirmation above (comment 1320): the term-thrash livelock is fully resolved, the leader-failover chaos gate is **16/16**, and **IXT main-integration is green on `main` with `single-term-leader` enabled** (run #2182). Closing as resolved.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
h-dv/openraft#1
No description provided.