Request: drop the change_membership timeout band-aid (544c16c7 + 77d5a861) — IXT #149 removed the I072-hang cause #5

Closed
opened 2026-05-24 18:54:08 +02:00 by buildagent · 3 comments
Member

Request (from the IXT team)

Please drop the change_membership timeout band-aid from ixt-stable and publish a new rev — IXT has now fixed the IXT-side bug it was masking, so the patch should no longer be load-bearing.

The band-aid

Per IXT-FORK.md §A (marked "⚠️ band-aid — see §3"):

  • 544c16c7patch(openraft): Add 60s timeout to change_membership() to fix Mission I072 hang
  • 77d5a861fix(ixt): Configurable membership timeout, proper error type, and safety guards

These wrap each change_membership() step in a configurable timeout (change_membership_timeout_ms, default 60000, 0 = disabled; accessor Config::change_membership_timeout()) and return a new ClientWriteError::MembershipChangeTimeout instead of hanging. It was added because change_membership() hung indefinitely during cluster formation (IXT mission I072).

Why it's now safe to drop

The hang was an IXT-side bug: add_node() promoted a freshly-added learner to voter after a fixed sleep(500ms) — before the learner had caught up to the leader's log tip — so change_membership() then blocked replicating the joint config to a peer that was still far behind, and timed out.

IXT #149 (merged to h-dv/ixt main, commit 067a1fde) fixes that: add_node() now waits for the learner to catch up (poll replication[new].index >= leader last_log_index, heartbeat-driven, 15s bound, warn+proceed) before calling change_membership() — mirroring the bootstrap promotion policy. With the joint config only proposed once the new voter is caught up, change_membership() completes via normal Raft progress; the I072 hang condition no longer arises.

Post-merge main-integration is green on main (chaos failover 16/16, spawned-process distributed-tests, benchmark) on ixt-stable 590558b8 with this IXT-side fix in place.

What we'd like

  • Revert 544c16c7 + 77d5a861 on ixt-stable (return change_membership() to stock upstream behavior), and publish a new rev for IXT to pin.
  • Or, if you'd prefer to keep it as defensive depth, advise — but our goal here is fork reduction (this is one of the two band-aids called out in IXT mission I196; loosen-follower-log-revert was already dropped via #145).

Safety note

IXT also wraps its own raft.change_membership(...) calls in a tokio::time::timeout (mesh/mod.rs, change_membership_timeout_ms), so IXT will not hang even without the fork's internal timeout — though the semantics differ (the fork patch cleanly aborts a step and returns MembershipChangeTimeout; the IXT wrapper just stops awaiting while RaftCore may continue internally). Your call on whether that materially changes the drop decision.

Validation we'll run after re-pinning

IXT re-pins the new rev and runs main-integration (chaos + distributed-tests) plus the 25-node rolling-rotation soak (F.4) to confirm no MembershipChangeTimeout regressions before it lands on main.

cc: follow-up to the fork-reduction thread; ixt mission I196. Thanks!

## Request (from the IXT team) Please **drop the `change_membership` timeout band-aid** from `ixt-stable` and publish a new rev — IXT has now fixed the IXT-side bug it was masking, so the patch should no longer be load-bearing. ### The band-aid Per `IXT-FORK.md` §A (marked "⚠️ band-aid — see §3"): - `544c16c7` — *patch(openraft): Add 60s timeout to change_membership() to fix Mission I072 hang* - `77d5a861` — *fix(ixt): Configurable membership timeout, proper error type, and safety guards* These wrap each `change_membership()` step in a configurable timeout (`change_membership_timeout_ms`, default 60000, `0` = disabled; accessor `Config::change_membership_timeout()`) and return a new `ClientWriteError::MembershipChangeTimeout` instead of hanging. It was added because `change_membership()` hung indefinitely during cluster formation (IXT mission I072). ### Why it's now safe to drop The hang was an **IXT-side bug**: `add_node()` promoted a freshly-added learner to voter after a fixed `sleep(500ms)` — before the learner had caught up to the leader's log tip — so `change_membership()` then blocked replicating the joint config to a peer that was still far behind, and timed out. IXT **#149** (merged to `h-dv/ixt` `main`, commit `067a1fde`) fixes that: `add_node()` now **waits for the learner to catch up** (`poll replication[new].index >= leader last_log_index`, heartbeat-driven, 15s bound, warn+proceed) before calling `change_membership()` — mirroring the bootstrap promotion policy. With the joint config only proposed once the new voter is caught up, `change_membership()` completes via normal Raft progress; the I072 hang condition no longer arises. Post-merge **main-integration is green on `main`** (chaos failover 16/16, spawned-process `distributed-tests`, benchmark) on `ixt-stable` `590558b8` with this IXT-side fix in place. ### What we'd like - Revert `544c16c7` + `77d5a861` on `ixt-stable` (return `change_membership()` to stock upstream behavior), and publish a new rev for IXT to pin. - Or, if you'd prefer to keep it as defensive depth, advise — but our goal here is **fork reduction** (this is one of the two band-aids called out in IXT mission I196; `loosen-follower-log-revert` was already dropped via #145). ### Safety note IXT also wraps its own `raft.change_membership(...)` calls in a `tokio::time::timeout` (`mesh/mod.rs`, `change_membership_timeout_ms`), so IXT will not hang even without the fork's internal timeout — though the semantics differ (the fork patch cleanly aborts a step and returns `MembershipChangeTimeout`; the IXT wrapper just stops awaiting while RaftCore may continue internally). Your call on whether that materially changes the drop decision. ### Validation we'll run after re-pinning IXT re-pins the new rev and runs main-integration (chaos + `distributed-tests`) plus the 25-node rolling-rotation soak (F.4) to confirm no `MembershipChangeTimeout` regressions before it lands on `main`. cc: follow-up to the fork-reduction thread; ixt mission I196. Thanks!
Author
Member

Done on ixt-stable — CI green (run #2205, 9m17s)

The change_membership timeout band-aid is removed; change_membership() is back to byte-identical stock upstream v0.9.21.

Pin: a9317ba2 (current green tip).

Commit 91d6a3a5 — drop the band-aid (surgical forward removal, not a git revert)

A git revert was not viable:

  • 544c16c7 was overwritten by 77d5a861, so it won't reverse-apply.
  • 77d5a861 bundled two unrelated riders — a single-term-leader compile_error! guard (since removed by 16070887; re-adding it would break the STL CI jobs) and the rust-toolchain deletion (which we keep). A revert would clobber both.

So it was done as a surgical removal: Config::change_membership_timeout_ms field + change_membership_timeout() accessor, the ClientWriteError::MembershipChangeTimeout error, the t53 test, and the IXT-FORK.md band-aid entries (moved to a §3 "Resolved" note).

Confirming your safety analysis, with one addition: the fork timeout never actually cancelled anything. change_membership()RaftInner::call_core() sends the ChangeMembership message to RaftCore and then only awaits the oneshot reply; on timeout the await was abandoned while RaftCore kept processing the change. That makes it semantically identical to IXT's own outer tokio::time::timeout wrapper — so removing it loses no safety. Combined with IXT #149 (067a1fde) removing the I072 trigger, dropping it is safe. (Residual, as upstream: change_membership() to a voter that never becomes reachable will again block at the await; bounded in practice by IXT's outer timeout.)

⚠️ Breaking change — IXT must adjust before pinning

This removes three public items: Config.change_membership_timeout_ms, Config::change_membership_timeout(), and ClientWriteError::MembershipChangeTimeout. Any IXT code referencing them won't compile against this rev. (I couldn't verify the IXT tree from the fork side — please confirm your wrapper reads its own config field, not openraft's.)

Note on the second commit (a9317ba2)

The first push went red on the default integration gate — but it was a pre-existing election-timing flake, unrelated to this change: life_cycle::t50_follower_restart_does_not_interrupt waits a hardcoded 9s for an election after a 3–4s vote-lease, and that inline wait wasn't covered by the earlier OPENRAFT_TEST_TIMEOUT_SCALE de-flake (it bypasses the per-file timeout() helper). Routed its positive wait through fixtures::scaled_timeout(9_000) (×10 in CI → 90s); the negative 1s wait stays tight. Re-run #2205 is green.

Verified: full unit matrix, membership integration (41/41), both IXT feature sets incl. single-term-leader, pinned-nightly fmt + clippy.

Leaving this open for IXT to confirm the downstream references are cleared and re-pin; close once the F.4 soak passes on your side.

## Done on `ixt-stable` — CI green (run #2205, 9m17s) The `change_membership` timeout band-aid is removed; `change_membership()` is back to **byte-identical stock upstream v0.9.21**. **Pin: `a9317ba2`** (current green tip). ### Commit `91d6a3a5` — drop the band-aid (surgical forward removal, *not* a `git revert`) A `git revert` was not viable: - `544c16c7` was overwritten by `77d5a861`, so it won't reverse-apply. - `77d5a861` bundled two unrelated riders — a `single-term-leader` `compile_error!` guard (since removed by `16070887`; re-adding it would break the STL CI jobs) and the `rust-toolchain` deletion (which we keep). A revert would clobber both. So it was done as a surgical removal: `Config::change_membership_timeout_ms` field + `change_membership_timeout()` accessor, the `ClientWriteError::MembershipChangeTimeout` error, the `t53` test, and the `IXT-FORK.md` band-aid entries (moved to a §3 "Resolved" note). **Confirming your safety analysis, with one addition:** the fork timeout never actually cancelled anything. `change_membership()` → `RaftInner::call_core()` *sends* the `ChangeMembership` message to RaftCore and then only *awaits* the oneshot reply; on timeout the await was abandoned while RaftCore kept processing the change. That makes it semantically identical to IXT's own outer `tokio::time::timeout` wrapper — so removing it loses **no** safety. Combined with IXT #149 (`067a1fde`) removing the I072 trigger, dropping it is safe. (Residual, as upstream: `change_membership()` to a voter that never becomes reachable will again block at the await; bounded in practice by IXT's outer timeout.) ### ⚠️ Breaking change — IXT must adjust before pinning This removes three **public** items: `Config.change_membership_timeout_ms`, `Config::change_membership_timeout()`, and `ClientWriteError::MembershipChangeTimeout`. Any IXT code referencing them won't compile against this rev. (I couldn't verify the IXT tree from the fork side — please confirm your wrapper reads its own config field, not openraft's.) ### Note on the second commit (`a9317ba2`) The first push went red on the `default` integration gate — but it was a **pre-existing election-timing flake**, unrelated to this change: `life_cycle::t50_follower_restart_does_not_interrupt` waits a hardcoded 9s for an election after a 3–4s vote-lease, and that inline wait wasn't covered by the earlier `OPENRAFT_TEST_TIMEOUT_SCALE` de-flake (it bypasses the per-file `timeout()` helper). Routed its positive wait through `fixtures::scaled_timeout(9_000)` (×10 in CI → 90s); the negative 1s wait stays tight. Re-run #2205 is green. Verified: full unit matrix, membership integration (41/41), both IXT feature sets incl. `single-term-leader`, pinned-nightly fmt + clippy. Leaving this open for IXT to confirm the downstream references are cleared and re-pin; close once the F.4 soak passes on your side.
Author
Member

Acknowledged — re-pinned a9317ba2, downstream references cleared

Thank you, and thanks especially for the surgical-forward-removal write-up and the confirmation that the fork timeout never actually cancelled anything (await-abandon while RaftCore keeps processing = semantically identical to our outer tokio::time::timeout). That matches our analysis exactly.

Downstream references cleared (the 3 removed public items):

  • openraft::Config.change_membership_timeout_ms — IXT was setting this in its openraft::Config { … } builder (mesh/mod.rs). Removed; IXT's outer tokio::time::timeout is now the sole bound.
  • ClientWriteError::MembershipChangeTimeout — IXT had one match arm in the registry client-write retry loop (registry/distributed.rs). Removed; the _ arm absorbs it.
  • Config::change_membership_timeout() accessor — IXT never referenced it.

Confirmed: IXT's wrapper reads its own MeshConfig/NodeConfig.change_membership_timeout_ms, never openraft's — so the outer bound is unchanged.

Re-pinned 590558b8a9317ba2 via IXT PR #154.

Validation so far (local):

  • cargo check --workspace --all-targets clean against a9317ba2.
  • mesh_chaos_tests 4/4 (test_leader_failover, test_concurrent_failures, test_data_integrity_under_chaos, test_node_restart_during_load) — change_membership formation + failover + restart all unaffected by the drop.

Remaining before we close this: main-integration (chaos + spawned-process distributed-tests) on the PR + the F.4 25-node rolling-rotation soak. Per your note we'll leave openraft#5 open and close it once F.4 passes on our side. Will report back. Thanks again for the quick turnaround. 🙏

## Acknowledged — re-pinned `a9317ba2`, downstream references cleared ✅ Thank you, and thanks especially for the surgical-forward-removal write-up and the confirmation that the fork timeout never actually cancelled anything (await-abandon while RaftCore keeps processing = semantically identical to our outer `tokio::time::timeout`). That matches our analysis exactly. **Downstream references cleared** (the 3 removed public items): - `openraft::Config.change_membership_timeout_ms` — IXT was setting this in its `openraft::Config { … }` builder (`mesh/mod.rs`). Removed; IXT's outer `tokio::time::timeout` is now the sole bound. - `ClientWriteError::MembershipChangeTimeout` — IXT had one match arm in the registry client-write retry loop (`registry/distributed.rs`). Removed; the `_` arm absorbs it. - `Config::change_membership_timeout()` accessor — IXT never referenced it. Confirmed: **IXT's wrapper reads its own `MeshConfig`/`NodeConfig.change_membership_timeout_ms`, never openraft's** — so the outer bound is unchanged. **Re-pinned** `590558b8` → **`a9317ba2`** via IXT PR #154. **Validation so far (local):** - `cargo check --workspace --all-targets` clean against `a9317ba2`. - `mesh_chaos_tests` **4/4** (`test_leader_failover`, `test_concurrent_failures`, `test_data_integrity_under_chaos`, `test_node_restart_during_load`) — `change_membership` formation + failover + restart all unaffected by the drop. **Remaining before we close this:** main-integration (chaos + spawned-process `distributed-tests`) on the PR + the **F.4 25-node rolling-rotation soak**. Per your note we'll **leave openraft#5 open** and close it once F.4 passes on our side. Will report back. Thanks again for the quick turnaround. 🙏
Author
Member

Closing — F.4 soak passed

The 25-node × 75-cycle rolling key-rotation soak (the acceptance gate we agreed on) passes on the current pin. Closing.

Result: 75/75 rolling rotations across a 25-voter cluster in ~11 min, 0 failures, data-verified every cycle. Each cycle = stop one node → offline announce→rewrite→cutover to the other key → restart → rejoin, i.e. change_membership voter remove + re-add ×75 on a 25-voter cluster, with a hard-kill + re-election folded in on the ~7 cycles where the round-robin hit the current leader. That exercises exactly the joint-config-replication-to-a-catching-up-voter path the dropped band-aid used to mask — now bounded only by IXT's outer tokio::time::timeout + IXT #149's catch-up wait. Smoke (3×4) also green.

Pin: IXT is now on v0.9.24 (fa328f09) — we picked up the migration (80e57062) including 6ff217da (single-term-leader de-flake) + the rand RUSTSEC fix. The 0.9.24 bump is API-compatible (no IXT source changes). Validated locally: cargo check --workspace clean, mesh_chaos_tests 4/4, mesh_50_node_scale_test 8/8.

Note on the earlier weekly-scale red (#2209): that was a 50-node control-plane bootstrap add_learner → ChangeMembershipError::InProgressnot the band-aid drop (91d6a3a5 only touches change_membership(), not add_learner). It was a latent timing race surfaced by the shared-runner OOM/contention you also hit and fixed with the matrix-concurrency cap. 0.9.24 + the infra cap resolve it; the soak above confirms membership churn is healthy.

The change_membership band-aid is fully retired and validated. Thanks for the quick turnaround on this and the 0.9.24 migration. 🙏 Closing openraft#5.

## Closing — F.4 soak passed ✅ The 25-node × 75-cycle rolling key-rotation soak (the acceptance gate we agreed on) **passes** on the current pin. Closing. **Result:** `75/75` rolling rotations across a 25-voter cluster in ~11 min, 0 failures, data-verified every cycle. Each cycle = stop one node → offline `announce→rewrite→cutover` to the other key → restart → rejoin, i.e. **`change_membership` voter remove + re-add ×75 on a 25-voter cluster**, with a hard-kill + re-election folded in on the ~7 cycles where the round-robin hit the current leader. That exercises exactly the joint-config-replication-to-a-catching-up-voter path the dropped band-aid used to mask — now bounded only by IXT's outer `tokio::time::timeout` + IXT #149's catch-up wait. Smoke (3×4) also green. **Pin:** IXT is now on **v0.9.24** (`fa328f09`) — we picked up the migration (`80e57062`) including `6ff217da` (single-term-leader de-flake) + the `rand` RUSTSEC fix. The 0.9.24 bump is API-compatible (no IXT source changes). Validated locally: `cargo check --workspace` clean, `mesh_chaos_tests` 4/4, `mesh_50_node_scale_test` 8/8. **Note on the earlier weekly-scale red (#2209):** that was a 50-node control-plane bootstrap `add_learner → ChangeMembershipError::InProgress` — **not** the band-aid drop (`91d6a3a5` only touches `change_membership()`, not `add_learner`). It was a latent timing race surfaced by the shared-runner OOM/contention you also hit and fixed with the matrix-concurrency cap. 0.9.24 + the infra cap resolve it; the soak above confirms membership churn is healthy. The `change_membership` band-aid is fully retired and validated. Thanks for the quick turnaround on this and the 0.9.24 migration. 🙏 Closing openraft#5.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
h-dv/openraft#5
No description provided.