Skip to content

maintainer failover can leave in-flight operators inconsistent #4763

@wlwilliamx

Description

@wlwilliamx

What did you do?

Started a TiCDC changefeed on three captures, triggered in-flight create/remove/move/split/merge operators, and kept them unfinished with failpoints so the maintainer would fail over in the middle of scheduling.

A reproducible scenario is now encoded in tests/integration_tests/maintainer_failover_when_operator/run.sh:

  • block dispatcher close and create;
  • trigger add, remove, move, split, and merge;
  • kill the current maintainer before those operators finish;
  • wait for maintainer move and observe bootstrap recovery on the new maintainer.

What did you expect to see?

The new maintainer should rebuild unfinished operator state from bootstrap responses and let the original operations converge:

  • in-flight add/remove/move/split operators continue from persisted scheduling state;
  • in-flight merge operators are restored and either finish or clean up correctly;
  • dropped tables stay dropped, and no duplicate or ghost replicas remain after failover.

What did you see instead?

Bootstrap recovery did not fully cover the in-flight operator state.

ScheduleDispatcherRequest-based recovery still had corner cases around create/remove-related operators, and in-flight MergeDispatcherRequest state was not included in MaintainerBootstrapResponse at all. After maintainer failover, the new maintainer could therefore:

  • lose the merge intent and leave source dispatchers stuck in WaitingMerge or leave a half-created merged dispatcher behind;
  • recreate duplicate tasks because restored operator state and bootstrap spans were reconciled incompletely;
  • reschedule a table that was already dropped or keep orphan dispatchers alive after remove/move/split related failover cases.

In these cases the changefeed may not converge automatically after maintainer failover and may require manual intervention.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

Local integration test cluster started by TiCDC integration scripts.
The exact TiDB version was not captured during issue drafting; the bug is in TiCDC maintainer/bootstrap recovery logic and reproduces on current `master`.

Upstream TiKV version (execute tikv-server --version):

Local integration test cluster started by TiCDC integration scripts.
The exact TiKV version was not captured during issue drafting; the bug is in TiCDC maintainer/bootstrap recovery logic and reproduces on current `master`.

TiCDC version (execute cdc version):

Reproduced on `pingcap/ticdc` `master` at `567506c4fd71adfe9bad80bab7ba79b1f5dc92f9` before the fix in PR #3769.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions