fix(daemon): self-healing reconnect + root-identity handshake (issue #7) #8

Merged
buildagent merged 1 commit from fix/issue-7-daemon-reconnect-identity into master 2026-06-08 16:05:43 +02:00
Member

Summary

Fixes #7 — a workspace-links user reported the primary MCP server "attaching to a linked project's daemon", with every tool call then failing -32603 sending <op> until the MCP server was restarted.

A deep multi-agent review against current code found the headline was a misread (per-root lockfile discovery is structurally incapable of attaching to a link's port — and the attach log carried no project tag, so the link's own attach line read identically to the primary's). But two real, confirmed defects underlie the symptom.

Root causes

  1. RpcIndex never reconnects. It cached a single TcpStream with no health re-check, so once its daemon died / was killed / restarted onto a new port, every subsequent call wrote to a dead socket forever. This is the actual cause of the reported outage.
  2. No root-identity on the handshake. Stats and the lockfile payload carried no root, so attach/reconnect could in principle latch onto any code-index daemon on a given port.

Fixes (all wire-compatible — new fields are Option + skip_if_none)

  • RpcIndex::connect_with_reconnect(root, port): on a connection-level failure, drop the dead channel, re-read the lockfile, reconnect to the current port, verify the daemon serves root, and retry once. Bare connect(port) keeps the old no-reconnect behavior (tests, lifetime owners). Retry is connection-errors-only (never an RpcError from dispatch) and capped at one attempt.
  • root in the stats response (stamped by the daemon dispatch layer), verified on attach (main.rs) and reconnect (rpc_index.rs). A daemon that omits root (pre-0.5.0) is accepted — unverifiable, not wrong — so no respawn churn against old daemons.
  • root in the lockfile payload; Lockfile::read rejects a payload describing a different project (aliased/symlinked .code-index/).
  • Observability: log the full anyhow chain at warn before mapping the ~19 MCP tool-handler internal errors (was bare e.to_string(), nothing logged); tag attach logs with root (the missing project tag that drove the misdiagnosis); raise the daemon transport-failure log to warn.
  • resolve_links warns when a link is nested under the primary root.

Compatibility

Every wire change is backward/forward compatible: new MCP + old daemon (no root → accepted), old MCP + new daemon (extra field ignored by serde). Rejection fires only on a confirmed mismatch, never on an absent field — so a new server won't needlessly respawn healthy old daemons.

Tests

  • reconnect_aware_client_survives_daemon_restart — kill daemon under a live connection, bring a fresh one up, assert the same client heals (this reproduced the bug).
  • plain_connect_does_not_reconnect — guards the opt-in boundary.
  • reconnect_rejects_a_foreign_daemon — stale lockfile points reconnect at a different live daemon; must reject.
  • Stats/Lockfile wire-compat round-trips + root-mismatch unit tests.

Full workspace suite green (270+ tests), clippy clean, rustfmt clean. Targets release v0.5.0.

Closes #7.

🤖 Generated with Claude Code

## Summary Fixes **#7** — a workspace-links user reported the primary MCP server "attaching to a linked project's daemon", with every tool call then failing `-32603 sending <op>` until the MCP server was restarted. A deep multi-agent review against current code found the headline was a **misread** (per-root lockfile discovery is structurally incapable of attaching to a link's port — and the attach log carried no project tag, so the link's own attach line read identically to the primary's). But two real, confirmed defects underlie the symptom. ## Root causes 1. **`RpcIndex` never reconnects.** It cached a single `TcpStream` with no health re-check, so once its daemon died / was killed / restarted onto a new port, every subsequent call wrote to a dead socket *forever*. **This is the actual cause of the reported outage.** 2. **No root-identity on the handshake.** `Stats` and the lockfile payload carried no `root`, so attach/reconnect could in principle latch onto any code-index daemon on a given port. ## Fixes (all wire-compatible — new fields are `Option` + `skip_if_none`) - **`RpcIndex::connect_with_reconnect(root, port)`**: on a *connection-level* failure, drop the dead channel, re-read the lockfile, reconnect to the current port, **verify the daemon serves `root`**, and retry **once**. Bare `connect(port)` keeps the old no-reconnect behavior (tests, lifetime owners). Retry is connection-errors-only (never an `RpcError` from dispatch) and capped at one attempt. - **`root` in the stats response** (stamped by the daemon dispatch layer), verified on attach (`main.rs`) and reconnect (`rpc_index.rs`). A daemon that omits `root` (pre-0.5.0) is **accepted** — unverifiable, not wrong — so no respawn churn against old daemons. - **`root` in the lockfile payload**; `Lockfile::read` rejects a payload describing a different project (aliased/symlinked `.code-index/`). - **Observability**: log the full anyhow chain at `warn` before mapping the ~19 MCP tool-handler internal errors (was bare `e.to_string()`, nothing logged); tag attach logs with `root` (the missing project tag that drove the misdiagnosis); raise the daemon transport-failure log to `warn`. - `resolve_links` warns when a link is nested under the primary root. ## Compatibility Every wire change is backward/forward compatible: new MCP + old daemon (no `root` → accepted), old MCP + new daemon (extra field ignored by serde). Rejection fires only on a *confirmed* mismatch, never on an absent field — so a new server won't needlessly respawn healthy old daemons. ## Tests - `reconnect_aware_client_survives_daemon_restart` — kill daemon under a live connection, bring a fresh one up, assert the same client heals (this reproduced the bug). - `plain_connect_does_not_reconnect` — guards the opt-in boundary. - `reconnect_rejects_a_foreign_daemon` — stale lockfile points reconnect at a different live daemon; must reject. - `Stats`/`Lockfile` wire-compat round-trips + root-mismatch unit tests. Full workspace suite green (270+ tests), clippy clean, rustfmt clean. Targets release **v0.5.0**. Closes #7. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fix(daemon): self-healing reconnect + root-identity handshake (issue #7)
All checks were successful
CI / cargo fmt (pull_request) Successful in 38s
CI / cargo clippy (pull_request) Successful in 52s
CI / cargo test (pull_request) Successful in 1m33s
7f205d7fbc
A workspace-links user reported the primary MCP server "attaching to a
linked project's daemon", with every tool call then failing
`-32603 sending <op>` until the MCP server was restarted. A deep review
against current code found the headline was a misread (per-root lockfile
discovery is structurally incapable of attaching to a link's port), but
two real, confirmed defects underlie the symptom:

1. RpcIndex cached a single TcpStream with no reconnect/health-recheck:
   once its daemon died, was killed, or restarted onto a new port, every
   subsequent call wrote to a dead socket forever. This is the actual
   cause of the reported outage.
2. The attach handshake performed no root-identity verification — Stats
   and the lockfile payload carried no root — so attach/reconnect could
   in principle latch onto any code-index daemon on a given port.

Fixes (all wire-compatible — new fields are optional + skip_if_none):
- RpcIndex::connect_with_reconnect(root, port): on a connection-level
  failure, drop the dead channel, re-read the lockfile, reconnect to the
  current port, verify the daemon serves `root`, and retry once. Bare
  connect(port) keeps the old no-reconnect behavior (tests, lifetime
  owners). Retry is connection-errors-only — never an RpcError from
  dispatch — and capped at one attempt.
- Add `root` to the stats response (stamped by the daemon dispatch
  layer) and verify it on attach (main.rs) and reconnect (rpc_index.rs).
  A daemon that omits root (pre-0.5.0) is accepted — unverifiable, not
  wrong — so no respawn churn against old daemons.
- Add `root` to the lockfile payload; Lockfile::read rejects a payload
  describing a different project (aliased/symlinked .code-index/).
- Observability: log the full anyhow chain at warn before mapping the
  ~19 MCP tool-handler internal errors (was bare e.to_string(), nothing
  logged); tag attach logs with root (the missing project tag that drove
  the misdiagnosis); raise the daemon transport-failure log to warn.
- resolve_links warns when a link is nested under the primary root.

Tests: reconnect-survives-restart, plain-connect-does-not-reconnect,
reconnect-rejects-a-foreign-daemon (all spin up real daemons), plus
Stats/Lockfile wire-compat + root-mismatch unit tests.

Closes #7.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
buildagent deleted branch fix/issue-7-daemon-reconnect-identity 2026-06-08 16:05:43 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
h-dv/code-index!8
No description provided.