maintainer failover can leave in-flight operators inconsistent

### What did you do?

Started a TiCDC changefeed on three captures, triggered in-flight create/remove/move/split/merge operators, and kept them unfinished with failpoints so the maintainer would fail over in the middle of scheduling.

A reproducible scenario is now encoded in `tests/integration_tests/maintainer_failover_when_operator/run.sh`:
- block dispatcher close and create;
- trigger add, remove, move, split, and merge;
- kill the current maintainer before those operators finish;
- wait for maintainer move and observe bootstrap recovery on the new maintainer.

### What did you expect to see?

The new maintainer should rebuild unfinished operator state from bootstrap responses and let the original operations converge:
- in-flight add/remove/move/split operators continue from persisted scheduling state;
- in-flight merge operators are restored and either finish or clean up correctly;
- dropped tables stay dropped, and no duplicate or ghost replicas remain after failover.

### What did you see instead?

Bootstrap recovery did not fully cover the in-flight operator state.

`ScheduleDispatcherRequest`-based recovery still had corner cases around create/remove-related operators, and in-flight `MergeDispatcherRequest` state was not included in `MaintainerBootstrapResponse` at all. After maintainer failover, the new maintainer could therefore:
- lose the merge intent and leave source dispatchers stuck in `WaitingMerge` or leave a half-created merged dispatcher behind;
- recreate duplicate tasks because restored operator state and bootstrap spans were reconciled incompletely;
- reschedule a table that was already dropped or keep orphan dispatchers alive after remove/move/split related failover cases.

In these cases the changefeed may not converge automatically after maintainer failover and may require manual intervention.

### Versions of the cluster

Upstream TiDB cluster version (execute `SELECT tidb_version();` in a MySQL client):

```console
Local integration test cluster started by TiCDC integration scripts.
The exact TiDB version was not captured during issue drafting; the bug is in TiCDC maintainer/bootstrap recovery logic and reproduces on current `master`.
```

Upstream TiKV version (execute `tikv-server --version`):

```console
Local integration test cluster started by TiCDC integration scripts.
The exact TiKV version was not captured during issue drafting; the bug is in TiCDC maintainer/bootstrap recovery logic and reproduces on current `master`.
```

TiCDC version (execute `cdc version`):

```console
Reproduced on `pingcap/ticdc` `master` at `567506c4fd71adfe9bad80bab7ba79b1f5dc92f9` before the fix in PR #3769.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maintainer failover can leave in-flight operators inconsistent #4763

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

maintainer failover can leave in-flight operators inconsistent #4763

Description

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions