What did you do?
Started a TiCDC changefeed on three captures, triggered in-flight create/remove/move/split/merge operators, and kept them unfinished with failpoints so the maintainer would fail over in the middle of scheduling.
A reproducible scenario is now encoded in tests/integration_tests/maintainer_failover_when_operator/run.sh:
- block dispatcher close and create;
- trigger add, remove, move, split, and merge;
- kill the current maintainer before those operators finish;
- wait for maintainer move and observe bootstrap recovery on the new maintainer.
What did you expect to see?
The new maintainer should rebuild unfinished operator state from bootstrap responses and let the original operations converge:
- in-flight add/remove/move/split operators continue from persisted scheduling state;
- in-flight merge operators are restored and either finish or clean up correctly;
- dropped tables stay dropped, and no duplicate or ghost replicas remain after failover.
What did you see instead?
Bootstrap recovery did not fully cover the in-flight operator state.
ScheduleDispatcherRequest-based recovery still had corner cases around create/remove-related operators, and in-flight MergeDispatcherRequest state was not included in MaintainerBootstrapResponse at all. After maintainer failover, the new maintainer could therefore:
- lose the merge intent and leave source dispatchers stuck in
WaitingMerge or leave a half-created merged dispatcher behind;
- recreate duplicate tasks because restored operator state and bootstrap spans were reconciled incompletely;
- reschedule a table that was already dropped or keep orphan dispatchers alive after remove/move/split related failover cases.
In these cases the changefeed may not converge automatically after maintainer failover and may require manual intervention.
Versions of the cluster
Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
Local integration test cluster started by TiCDC integration scripts.
The exact TiDB version was not captured during issue drafting; the bug is in TiCDC maintainer/bootstrap recovery logic and reproduces on current `master`.
Upstream TiKV version (execute tikv-server --version):
Local integration test cluster started by TiCDC integration scripts.
The exact TiKV version was not captured during issue drafting; the bug is in TiCDC maintainer/bootstrap recovery logic and reproduces on current `master`.
TiCDC version (execute cdc version):
Reproduced on `pingcap/ticdc` `master` at `567506c4fd71adfe9bad80bab7ba79b1f5dc92f9` before the fix in PR #3769.
What did you do?
Started a TiCDC changefeed on three captures, triggered in-flight create/remove/move/split/merge operators, and kept them unfinished with failpoints so the maintainer would fail over in the middle of scheduling.
A reproducible scenario is now encoded in
tests/integration_tests/maintainer_failover_when_operator/run.sh:What did you expect to see?
The new maintainer should rebuild unfinished operator state from bootstrap responses and let the original operations converge:
What did you see instead?
Bootstrap recovery did not fully cover the in-flight operator state.
ScheduleDispatcherRequest-based recovery still had corner cases around create/remove-related operators, and in-flightMergeDispatcherRequeststate was not included inMaintainerBootstrapResponseat all. After maintainer failover, the new maintainer could therefore:WaitingMergeor leave a half-created merged dispatcher behind;In these cases the changefeed may not converge automatically after maintainer failover and may require manual intervention.
Versions of the cluster
Upstream TiDB cluster version (execute
SELECT tidb_version();in a MySQL client):Upstream TiKV version (execute
tikv-server --version):TiCDC version (execute
cdc version):Reproduced on `pingcap/ticdc` `master` at `567506c4fd71adfe9bad80bab7ba79b1f5dc92f9` before the fix in PR #3769.