Labels
No labels
Kind/Breaking
Kind/Bug
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
h-dv/openraft#5
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Request (from the IXT team)
Please drop the
change_membershiptimeout band-aid fromixt-stableand publish a new rev — IXT has now fixed the IXT-side bug it was masking, so the patch should no longer be load-bearing.The band-aid
Per
IXT-FORK.md§A (marked "⚠️ band-aid — see §3"):544c16c7— patch(openraft): Add 60s timeout to change_membership() to fix Mission I072 hang77d5a861— fix(ixt): Configurable membership timeout, proper error type, and safety guardsThese wrap each
change_membership()step in a configurable timeout (change_membership_timeout_ms, default 60000,0= disabled; accessorConfig::change_membership_timeout()) and return a newClientWriteError::MembershipChangeTimeoutinstead of hanging. It was added becausechange_membership()hung indefinitely during cluster formation (IXT mission I072).Why it's now safe to drop
The hang was an IXT-side bug:
add_node()promoted a freshly-added learner to voter after a fixedsleep(500ms)— before the learner had caught up to the leader's log tip — sochange_membership()then blocked replicating the joint config to a peer that was still far behind, and timed out.IXT #149 (merged to
h-dv/ixtmain, commit067a1fde) fixes that:add_node()now waits for the learner to catch up (poll replication[new].index >= leader last_log_index, heartbeat-driven, 15s bound, warn+proceed) before callingchange_membership()— mirroring the bootstrap promotion policy. With the joint config only proposed once the new voter is caught up,change_membership()completes via normal Raft progress; the I072 hang condition no longer arises.Post-merge main-integration is green on
main(chaos failover 16/16, spawned-processdistributed-tests, benchmark) onixt-stable590558b8with this IXT-side fix in place.What we'd like
544c16c7+77d5a861onixt-stable(returnchange_membership()to stock upstream behavior), and publish a new rev for IXT to pin.loosen-follower-log-revertwas already dropped via #145).Safety note
IXT also wraps its own
raft.change_membership(...)calls in atokio::time::timeout(mesh/mod.rs,change_membership_timeout_ms), so IXT will not hang even without the fork's internal timeout — though the semantics differ (the fork patch cleanly aborts a step and returnsMembershipChangeTimeout; the IXT wrapper just stops awaiting while RaftCore may continue internally). Your call on whether that materially changes the drop decision.Validation we'll run after re-pinning
IXT re-pins the new rev and runs main-integration (chaos +
distributed-tests) plus the 25-node rolling-rotation soak (F.4) to confirm noMembershipChangeTimeoutregressions before it lands onmain.cc: follow-up to the fork-reduction thread; ixt mission I196. Thanks!
Done on
ixt-stable— CI green (run #2205, 9m17s)The
change_membershiptimeout band-aid is removed;change_membership()is back to byte-identical stock upstream v0.9.21.Pin:
a9317ba2(current green tip).Commit
91d6a3a5— drop the band-aid (surgical forward removal, not agit revert)A
git revertwas not viable:544c16c7was overwritten by77d5a861, so it won't reverse-apply.77d5a861bundled two unrelated riders — asingle-term-leadercompile_error!guard (since removed by16070887; re-adding it would break the STL CI jobs) and therust-toolchaindeletion (which we keep). A revert would clobber both.So it was done as a surgical removal:
Config::change_membership_timeout_msfield +change_membership_timeout()accessor, theClientWriteError::MembershipChangeTimeouterror, thet53test, and theIXT-FORK.mdband-aid entries (moved to a §3 "Resolved" note).Confirming your safety analysis, with one addition: the fork timeout never actually cancelled anything.
change_membership()→RaftInner::call_core()sends theChangeMembershipmessage to RaftCore and then only awaits the oneshot reply; on timeout the await was abandoned while RaftCore kept processing the change. That makes it semantically identical to IXT's own outertokio::time::timeoutwrapper — so removing it loses no safety. Combined with IXT #149 (067a1fde) removing the I072 trigger, dropping it is safe. (Residual, as upstream:change_membership()to a voter that never becomes reachable will again block at the await; bounded in practice by IXT's outer timeout.)⚠️ Breaking change — IXT must adjust before pinning
This removes three public items:
Config.change_membership_timeout_ms,Config::change_membership_timeout(), andClientWriteError::MembershipChangeTimeout. Any IXT code referencing them won't compile against this rev. (I couldn't verify the IXT tree from the fork side — please confirm your wrapper reads its own config field, not openraft's.)Note on the second commit (
a9317ba2)The first push went red on the
defaultintegration gate — but it was a pre-existing election-timing flake, unrelated to this change:life_cycle::t50_follower_restart_does_not_interruptwaits a hardcoded 9s for an election after a 3–4s vote-lease, and that inline wait wasn't covered by the earlierOPENRAFT_TEST_TIMEOUT_SCALEde-flake (it bypasses the per-filetimeout()helper). Routed its positive wait throughfixtures::scaled_timeout(9_000)(×10 in CI → 90s); the negative 1s wait stays tight. Re-run #2205 is green.Verified: full unit matrix, membership integration (41/41), both IXT feature sets incl.
single-term-leader, pinned-nightly fmt + clippy.Leaving this open for IXT to confirm the downstream references are cleared and re-pin; close once the F.4 soak passes on your side.
Acknowledged — re-pinned
a9317ba2, downstream references cleared ✅Thank you, and thanks especially for the surgical-forward-removal write-up and the confirmation that the fork timeout never actually cancelled anything (await-abandon while RaftCore keeps processing = semantically identical to our outer
tokio::time::timeout). That matches our analysis exactly.Downstream references cleared (the 3 removed public items):
openraft::Config.change_membership_timeout_ms— IXT was setting this in itsopenraft::Config { … }builder (mesh/mod.rs). Removed; IXT's outertokio::time::timeoutis now the sole bound.ClientWriteError::MembershipChangeTimeout— IXT had one match arm in the registry client-write retry loop (registry/distributed.rs). Removed; the_arm absorbs it.Config::change_membership_timeout()accessor — IXT never referenced it.Confirmed: IXT's wrapper reads its own
MeshConfig/NodeConfig.change_membership_timeout_ms, never openraft's — so the outer bound is unchanged.Re-pinned
590558b8→a9317ba2via IXT PR #154.Validation so far (local):
cargo check --workspace --all-targetsclean againsta9317ba2.mesh_chaos_tests4/4 (test_leader_failover,test_concurrent_failures,test_data_integrity_under_chaos,test_node_restart_during_load) —change_membershipformation + failover + restart all unaffected by the drop.Remaining before we close this: main-integration (chaos + spawned-process
distributed-tests) on the PR + the F.4 25-node rolling-rotation soak. Per your note we'll leave openraft#5 open and close it once F.4 passes on our side. Will report back. Thanks again for the quick turnaround. 🙏Closing — F.4 soak passed ✅
The 25-node × 75-cycle rolling key-rotation soak (the acceptance gate we agreed on) passes on the current pin. Closing.
Result:
75/75rolling rotations across a 25-voter cluster in ~11 min, 0 failures, data-verified every cycle. Each cycle = stop one node → offlineannounce→rewrite→cutoverto the other key → restart → rejoin, i.e.change_membershipvoter remove + re-add ×75 on a 25-voter cluster, with a hard-kill + re-election folded in on the ~7 cycles where the round-robin hit the current leader. That exercises exactly the joint-config-replication-to-a-catching-up-voter path the dropped band-aid used to mask — now bounded only by IXT's outertokio::time::timeout+ IXT #149's catch-up wait. Smoke (3×4) also green.Pin: IXT is now on v0.9.24 (
fa328f09) — we picked up the migration (80e57062) including6ff217da(single-term-leader de-flake) + therandRUSTSEC fix. The 0.9.24 bump is API-compatible (no IXT source changes). Validated locally:cargo check --workspaceclean,mesh_chaos_tests4/4,mesh_50_node_scale_test8/8.Note on the earlier weekly-scale red (#2209): that was a 50-node control-plane bootstrap
add_learner → ChangeMembershipError::InProgress— not the band-aid drop (91d6a3a5only toucheschange_membership(), notadd_learner). It was a latent timing race surfaced by the shared-runner OOM/contention you also hit and fixed with the matrix-concurrency cap. 0.9.24 + the infra cap resolve it; the soak above confirms membership churn is healthy.The
change_membershipband-aid is fully retired and validated. Thanks for the quick turnaround on this and the 0.9.24 migration. 🙏 Closing openraft#5.