Skip to content

downstreamadapter: preserve remove upgrade during close#4815

Open
hongyunyan wants to merge 16 commits intopingcap:masterfrom
hongyunyan:hyy/dispatcher-close-remove-upgrade
Open

downstreamadapter: preserve remove upgrade during close#4815
hongyunyan wants to merge 16 commits intopingcap:masterfrom
hongyunyan:hyy/dispatcher-close-remove-upgrade

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

@hongyunyan hongyunyan commented Apr 13, 2026

What problem does this PR solve?

Issue Number: close #4825

What is changed and how it works?

Background

The dispatcher orchestrator de-duplicates close requests by changefeed and message type. A later removed=true close request could overwrite the pending entry for the same key while the earlier removed=false request was already in flight.

Motivation

That overwrite was not re-queued, and Done(key) later deleted the upgraded request together with the in-flight one. As a result, the stronger remove semantics could be dropped silently and the downstream cleanup path would only execute the normal close flow.

Summary of changes

  • split the pending close queue state into queued and inFlight slots so an upgrade never overwrites the request a worker is already processing
  • re-queue upgraded removed=true close requests for the next round and keep the in-flight request stable until Done
  • make DispatcherManager keep removed=true as a sticky close requirement and finish remove-only cleanup even after the base close path has completed
  • add a retryable MySQL remove cleanup helper so ddl_ts cleanup still works after the normal sink close releases the long-lived DB handle
  • add regression tests for both the queue upgrade timing and the post-close remove cleanup path

Validation

  • make fmt
  • go test ./downstreamadapter/dispatcherorchestrator
  • go test ./downstreamadapter/dispatchermanager
  • go test --tags=intest ./downstreamadapter/sink/mysql

Summary by CodeRabbit

Release Notes

Bug Fixes

  • Enhanced changefeed removal and cleanup processes to properly handle late removal requests and ensure cleanup completes even after initial shutdown
  • Improved message queue dequeue operation to return message content directly, enabling more reliable request processing and ordering
  • Refined message queue behavior for more predictable duplicate prevention and request ordering during queue operations

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Warning

Rate limit exceeded

@hongyunyan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 32 minutes and 37 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 32 minutes and 37 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d42ab77-cf1a-4762-ac45-aed9ed8f656b

📥 Commits

Reviewing files that changed from the base of the PR and between 2119bf8 and 8df3eba.

📒 Files selected for processing (20)
  • cmd/storage-consumer/main.go
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • downstreamadapter/dispatcherorchestrator/helper.go
  • downstreamadapter/sink/blackhole/sink.go
  • downstreamadapter/sink/cloudstorage/dml_writers_test.go
  • downstreamadapter/sink/cloudstorage/sink.go
  • downstreamadapter/sink/cloudstorage/sink_test.go
  • downstreamadapter/sink/kafka/sink.go
  • downstreamadapter/sink/kafka/sink_test.go
  • downstreamadapter/sink/mock/sink_mock.go
  • downstreamadapter/sink/mysql/sink.go
  • downstreamadapter/sink/mysql/sink_test.go
  • downstreamadapter/sink/pulsar/sink.go
  • downstreamadapter/sink/redo/sink.go
  • downstreamadapter/sink/redo/sink_test.go
  • downstreamadapter/sink/sink.go
  • pkg/applier/redo.go
📝 Walkthrough

Walkthrough

This PR refactors changefeed removal cleanup to execute asynchronously using atomic flags and scheduled background tasks. The pending message queue semantics are updated to return messages directly from Pop(), eliminating separate Get/Done calls. MySQL sink cleanup is extracted into a dedicated method, and error handling is improved across the dispatcher management layer.

Changes

Cohort / File(s) Summary
Dispatcher Manager Cleanup Refactoring
downstreamadapter/dispatchermanager/dispatcher_manager.go, downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
Introduces atomic flags (removeChangefeedRequested, removeChangefeedCleaned, removeChangefeedCleanupRunning) to track async cleanup state. TryClose() now records removal requests and schedules cleanup separately. New tryScheduleRemoveChangefeedCleanup() and runRemoveChangefeedCleanup() methods handle removal-only cleanup via redo metadata or MySQL sink. closeRedoMeta() now returns error and propagates cleanup failures.
Pending Message Queue Refactoring
downstreamadapter/dispatcherorchestrator/helper.go, downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
Changes queue dequeue workflow: replaces separate Pop(), Get(), Done() methods with unified Pop(key, msg, ok) signature that deletes the key under lock and returns the message directly. Updates message consumption to use returned msg directly, removing Done() calls.
MySQL Sink Cleanup Method
downstreamadapter/sink/mysql/sink.go
Adds CleanupRemovedChangefeed() method for DDL timestamp cleanup using short-lived DB connections. Updates Close(removeChangefeed bool) to call this new helper instead of inline ddlWriter.RemoveDDLTsItem().
Test Updates
downstreamadapter/dispatchermanager/dispatcher_manager_test.go, downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
Adds new test TestTryCloseRemovedRequestAfterClosedReturnsImmediatelyAndTriggersCleanup for async cleanup verification. Renames and updates six pending message queue tests to work with new Pop() signature, removing Get/Done usage and validating returned message stability through queue upgrades.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DispatcherManager
    participant AtomicFlags
    participant Scheduler
    participant RedoMeta
    participant MySQLSink
    
    User->>DispatcherManager: TryClose(removeChangefeed=true)
    DispatcherManager->>AtomicFlags: Set removeChangefeedRequested=true
    DispatcherManager->>Scheduler: Schedule cleanup task
    DispatcherManager->>User: Return success
    
    Scheduler->>Scheduler: Wait for removal condition
    Scheduler->>AtomicFlags: Check removeChangefeedRequested & !removeChangefeedCleaned
    alt Redo Mode
        Scheduler->>RedoMeta: closeRedoMeta(removeChangefeed=true)
        RedoMeta->>RedoMeta: Cleanup(context.Background())
        RedoMeta-->>Scheduler: Return error or nil
    else MySQL Sink
        Scheduler->>MySQLSink: CleanupRemovedChangefeed()
        MySQLSink->>MySQLSink: Create temp DB connection
        MySQLSink->>MySQLSink: RemoveDDLTsItem()
        MySQLSink-->>Scheduler: Return result
    end
    
    Scheduler->>AtomicFlags: Set removeChangefeedCleaned=true
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

lgtm, approved, release-note, size/L

Suggested reviewers

  • wk989898
  • lidezhu
  • 3AceShowHand
  • wlwilliamx

Poem

🐰 Hops of joy for cleanup's new way,
Async flags guide the path today!
Queues return messages fresh and bright,
MySQL cleanup shines the light.
Error checks prevent dismay—
Better dispatcher code at play!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is largely complete with issue number, detailed problem statement, motivation, and comprehensive summary of changes, though release note section is missing. Add a release note following the Release Notes Language Style Guide. If no release note is needed, explicitly state 'None' in the release-note section.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'downstreamadapter: preserve remove upgrade during close' directly and specifically addresses the core change: ensuring that close requests with removed=true are preserved and not lost when upgrading from removed=false.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 13, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the changefeed closure and removal logic to ensure that cleanup tasks, such as redo metadata and sink-specific state, are handled reliably even if requested after the initial close. It introduces a state-tracking mechanism in the pendingMessageQueue to manage in-flight and queued requests separately. Feedback identifies a potential premature exit in the Pop method of the message queue when encountering stale entries and a possible resource leak because closeRedoMeta is no longer called during a normal close path.

Comment on lines 119 to 135
func (q *pendingMessageQueue) Pop() (pendingMessageKey, bool) {
return q.queue.Get()
key, ok := q.queue.Get()
if !ok {
return pendingMessageKey{}, false
}

q.mu.Lock()
defer q.mu.Unlock()

state := q.pending[key]
if state == nil || state.queued == nil {
return pendingMessageKey{}, false
}
state.inFlight = state.queued
state.queued = nil
return key, true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning false from Pop when state.queued is nil (but the channel is not closed) can cause the orchestrator loop to exit prematurely, as it typically interprets ok=false as a signal that the queue is closed. While the current TryEnqueue logic is designed to prevent this state, it is safer to handle stale entries by continuing to the next key in the channel rather than returning false.

Suggested change
func (q *pendingMessageQueue) Pop() (pendingMessageKey, bool) {
return q.queue.Get()
key, ok := q.queue.Get()
if !ok {
return pendingMessageKey{}, false
}
q.mu.Lock()
defer q.mu.Unlock()
state := q.pending[key]
if state == nil || state.queued == nil {
return pendingMessageKey{}, false
}
state.inFlight = state.queued
state.queued = nil
return key, true
}
func (q *pendingMessageQueue) Pop() (pendingMessageKey, bool) {
for {
key, ok := q.queue.Get()
if !ok {
return pendingMessageKey{}, false
}
q.mu.Lock()
state := q.pending[key]
if state != nil && state.queued != nil {
state.inFlight = state.queued
state.queued = nil
q.mu.Unlock()
return key, true
}
q.mu.Unlock()
}
}

Comment on lines 916 to 918
if e.IsRedoEnabled() {
e.redoSink.Close(removeChangefeed)
// FIXME: cleanup redo log when remove the changefeed
e.closeRedoMeta(removeChangefeed)
e.redoSink.Close(false)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to e.closeRedoMeta(removeChangefeed) was previously unconditional when redo was enabled. In the new implementation, closeRedoMeta(true) is only called inside cleanupRemovedChangefeed, which is skipped if removeChangefeedRequested is false. This means for a normal close, closeRedoMeta is never called. If closeRedoMeta(false) is required to release resources or stop background tasks associated with redo metadata, this change could lead to resource leaks.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
downstreamadapter/sink/mysql/sink.go (2)

446-446: Wrap the return error with errors.Trace().

♻️ Proposed fix
-	return cleanupWriter.RemoveDDLTsItem()
+	return errors.Trace(cleanupWriter.RemoveDDLTsItem())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/sink/mysql/sink.go` at line 446, The return value from
cleanupWriter.RemoveDDLTsItem() should be wrapped with errors.Trace() before
returning to preserve stack context; update the return in the function that
currently does "return cleanupWriter.RemoveDDLTsItem()" to instead return
errors.Trace(cleanupWriter.RemoveDDLTsItem()) so the error is traced (ensure
errors is the package providing Trace is imported and used consistently in this
file).

430-437: Wrap errors from library calls with errors.Trace().

Per coding guidelines, errors from third-party or library calls should be wrapped immediately to attach a stack trace for debugging.

♻️ Proposed fix
 	dsnStr, err := mysql.GenerateDSN(context.Background(), s.cfg)
 	if err != nil {
-		return err
+		return errors.Trace(err)
 	}
 	db, err := mysql.CreateMysqlDBConn(dsnStr)
 	if err != nil {
-		return err
+		return errors.Trace(err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/sink/mysql/sink.go` around lines 430 - 437, Wrap returned
errors from the library calls in errors.Trace before returning: when calling
mysql.GenerateDSN and mysql.CreateMysqlDBConn in the current function, replace
direct returns of err with returning errors.Trace(err) so both GenerateDSN and
CreateMysqlDBConn failures are wrapped with a stack trace (refer to the
mysql.GenerateDSN and mysql.CreateMysqlDBConn call sites).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@downstreamadapter/sink/mysql/sink.go`:
- Around line 445-446: The cleanupWriter created via mysql.NewWriter currently
isn't closed, leaking its internal resources; ensure you call
cleanupWriter.Close() (e.g., defer cleanupWriter.Close() immediately after
creating cleanupWriter or immediately after calling RemoveDDLTsItem()) so its
statement cache, ticker, context cancellation and DML session are
released—mirror how dmlWriter and ddlWriter are handled in the Close() method
and keep the call adjacent to the NewWriter/RemoveDDLTsItem usage.

---

Nitpick comments:
In `@downstreamadapter/sink/mysql/sink.go`:
- Line 446: The return value from cleanupWriter.RemoveDDLTsItem() should be
wrapped with errors.Trace() before returning to preserve stack context; update
the return in the function that currently does "return
cleanupWriter.RemoveDDLTsItem()" to instead return
errors.Trace(cleanupWriter.RemoveDDLTsItem()) so the error is traced (ensure
errors is the package providing Trace is imported and used consistently in this
file).
- Around line 430-437: Wrap returned errors from the library calls in
errors.Trace before returning: when calling mysql.GenerateDSN and
mysql.CreateMysqlDBConn in the current function, replace direct returns of err
with returning errors.Trace(err) so both GenerateDSN and CreateMysqlDBConn
failures are wrapped with a stack trace (refer to the mysql.GenerateDSN and
mysql.CreateMysqlDBConn call sites).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 385a1c47-a65f-44c0-8db4-781b7b87bb72

📥 Commits

Reviewing files that changed from the base of the PR and between 0a418b4 and 7250c59.

📒 Files selected for processing (5)
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • downstreamadapter/dispatcherorchestrator/helper.go
  • downstreamadapter/sink/mysql/sink.go

Comment thread downstreamadapter/sink/mysql/sink.go
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go (1)

146-159: Consider: Fragile reflection-based assertion.

The reflect.TypeOf(*state).NumField() == 1 assertion will break if the pendingQueueState struct gains additional fields. While this documents the current contract, consider adding a comment explaining the intent, or replace with a direct field check if the goal is just to verify queued is set correctly.

💡 Alternative approach
 	q.mu.Lock()
 	state := q.pending[key]
 	q.mu.Unlock()
 	require.NotNil(t, state)
-	require.Equal(t, 1, reflect.TypeOf(*state).NumField())
+	// Verify only the queued field is populated with the new message.
+	// This documents that Pop removes the entry and TryEnqueue creates fresh state.
 	require.Same(t, msg2, state.queued)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`
around lines 146 - 159, The test uses a fragile reflection assertion
reflect.TypeOf(*state).NumField() == 1 on the pendingQueueState struct; replace
this with a direct check of the field(s) you care about (e.g., verify
state.queued is set via require.NotNil/require.Same on the queued field) or, if
you must keep the intent as a contract, add a clear comment on pendingQueueState
and why the field-count matters; update the test around TryEnqueue,
pending[key], pendingQueueState and queued to remove the reflection-based
NumField check and assert the queued field directly (or document the contract)
instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`:
- Around line 146-159: The test uses a fragile reflection assertion
reflect.TypeOf(*state).NumField() == 1 on the pendingQueueState struct; replace
this with a direct check of the field(s) you care about (e.g., verify
state.queued is set via require.NotNil/require.Same on the queued field) or, if
you must keep the intent as a contract, add a clear comment on pendingQueueState
and why the field-count matters; update the test around TryEnqueue,
pending[key], pendingQueueState and queued to remove the reflection-based
NumField check and assert the queued field directly (or document the contract)
instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5bf6bff2-8b08-4b0b-83ef-8122edd07d9a

📥 Commits

Reviewing files that changed from the base of the PR and between 7250c59 and 5d92e4d.

📒 Files selected for processing (19)
  • cmd/storage-consumer/main.go
  • downstreamadapter/dispatcher/mock_sink_helper_test.go
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • downstreamadapter/dispatcherorchestrator/helper.go
  • downstreamadapter/sink/blackhole/sink.go
  • downstreamadapter/sink/cloudstorage/dml_writers_test.go
  • downstreamadapter/sink/cloudstorage/sink.go
  • downstreamadapter/sink/cloudstorage/sink_test.go
  • downstreamadapter/sink/kafka/sink.go
  • downstreamadapter/sink/kafka/sink_test.go
  • downstreamadapter/sink/mock/sink_mock.go
  • downstreamadapter/sink/mysql/sink.go
  • downstreamadapter/sink/mysql/sink_test.go
  • downstreamadapter/sink/pulsar/sink.go
  • downstreamadapter/sink/redo/sink.go
  • downstreamadapter/sink/sink.go
✅ Files skipped from review due to trivial changes (2)
  • downstreamadapter/sink/cloudstorage/dml_writers_test.go
  • downstreamadapter/sink/cloudstorage/sink_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatcherorchestrator/helper.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 14, 2026
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 14, 2026
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go (1)

82-100: Prefer deterministic synchronization over time.Sleep in this test.

Using time.Sleep at Line 92 can introduce flakiness under slow CI scheduling; use a start signal from the goroutine instead.

Suggested deterministic refactor
 func TestPendingMessageQueue_PopReturnsAfterClose(t *testing.T) {
 	t.Parallel()

 	q := newPendingMessageQueue()
 	doneCh := make(chan bool, 1)
+	startedCh := make(chan struct{})
 	go func() {
+		close(startedCh)
 		_, _, ok := q.Pop()
 		doneCh <- ok
 	}()

-	time.Sleep(10 * time.Millisecond)
+	<-startedCh
 	q.Close()

 	select {
 	case ok := <-doneCh:
 		require.False(t, ok)
 	case <-time.After(time.Second):
 		require.FailNow(t, "Pop did not return after context cancel")
 	}
 }

As per coding guidelines, **/*_test.go: “favor deterministic tests and use testify/require”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`
around lines 82 - 100, Replace the non-deterministic time.Sleep in
TestPendingMessageQueue_PopReturnsAfterClose with an explicit start signal from
the goroutine: create a startedCh (chan struct{}) that the goroutine closes or
sends to immediately after it begins and before calling q.Pop(), then have the
test wait on startedCh (with require.Eventually or a simple receive) prior to
calling q.Close(); keep assertions on the returned ok from q.Pop() the same and
continue using require for checks, referencing
TestPendingMessageQueue_PopReturnsAfterClose, newPendingMessageQueue, Pop, and
Close to locate where to add startedCh and the synchronization receive/send.
downstreamadapter/dispatchermanager/dispatcher_manager.go (1)

941-963: Consider adding a retry mechanism for failed remove cleanup.

When runRemoveChangefeedCleanup() fails, the error is logged but removeChangefeedCleaned stays false. Since the orchestrator (per context snippet 3) deletes the manager reference after TryClose returns true, there's no path to retry the cleanup unless TryClose(true) is called externally again before deletion.

This may be acceptable for "best-effort" semantics, but if ddl_ts cleanup is critical for correctness, consider:

  1. Adding a bounded retry loop within the goroutine, or
  2. Returning a cleanup error channel that callers can monitor
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/dispatcher_manager.go` around lines 941 -
963, The current tryScheduleRemoveChangefeedCleanup goroutine logs failures from
runRemoveChangefeedCleanup but never retries, leaving removeChangefeedCleaned
false and preventing future attempts; change tryScheduleRemoveChangefeedCleanup
to perform a bounded retry loop (e.g., maxAttempts with exponential/backoff
sleep) inside the goroutine while respecting removeChangefeedRequested and using
removeChangefeedCleanupRunning to guard concurrent runs, and set
removeChangefeedCleaned only on successful completion; alternatively, add an
exported cleanup error channel or callback that
tryScheduleRemoveChangefeedCleanup sends failures to (so callers of
TryClose/TryClose(true) can observe and re-trigger) — update references to
runRemoveChangefeedCleanup, removeChangefeedCleanupRunning,
removeChangefeedCleaned, and removeChangefeedRequested accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager.go`:
- Around line 941-963: The current tryScheduleRemoveChangefeedCleanup goroutine
logs failures from runRemoveChangefeedCleanup but never retries, leaving
removeChangefeedCleaned false and preventing future attempts; change
tryScheduleRemoveChangefeedCleanup to perform a bounded retry loop (e.g.,
maxAttempts with exponential/backoff sleep) inside the goroutine while
respecting removeChangefeedRequested and using removeChangefeedCleanupRunning to
guard concurrent runs, and set removeChangefeedCleaned only on successful
completion; alternatively, add an exported cleanup error channel or callback
that tryScheduleRemoveChangefeedCleanup sends failures to (so callers of
TryClose/TryClose(true) can observe and re-trigger) — update references to
runRemoveChangefeedCleanup, removeChangefeedCleanupRunning,
removeChangefeedCleaned, and removeChangefeedRequested accordingly.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`:
- Around line 82-100: Replace the non-deterministic time.Sleep in
TestPendingMessageQueue_PopReturnsAfterClose with an explicit start signal from
the goroutine: create a startedCh (chan struct{}) that the goroutine closes or
sends to immediately after it begins and before calling q.Pop(), then have the
test wait on startedCh (with require.Eventually or a simple receive) prior to
calling q.Close(); keep assertions on the returned ok from q.Pop() the same and
continue using require for checks, referencing
TestPendingMessageQueue_PopReturnsAfterClose, newPendingMessageQueue, Pop, and
Close to locate where to add startedCh and the synchronization receive/send.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85b8ed1d-eee7-4997-9a16-dba4441cddce

📥 Commits

Reviewing files that changed from the base of the PR and between f7c1904 and 2119bf8.

📒 Files selected for processing (6)
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • downstreamadapter/dispatcherorchestrator/helper.go
  • downstreamadapter/sink/mysql/sink.go
✅ Files skipped from review due to trivial changes (1)
  • downstreamadapter/dispatcherorchestrator/helper.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • downstreamadapter/sink/mysql/sink.go

if e.IsRedoEnabled() {
e.redoSink.Close(removeChangefeed)
// FIXME: cleanup redo log when remove the changefeed
e.closeRedoMeta(removeChangefeed)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It moves into runRemoveChangefeedCleanup, just like remove ddl_ts record.

func (m *DispatcherOrchestrator) handleMessages() {
for {
key, ok := m.msgQueue.Pop()
_, msg, ok := m.msgQueue.Pop()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_, msg, ok := m.msgQueue.Pop()
msg, ok := m.msgQueue.Pop()

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 15, 2026
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 16, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 16, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-15 08:58:22.885846307 +0000 UTC m=+1551508.091206364: ☑️ agreed by wk989898.
  • 2026-04-16 01:19:40.56462537 +0000 UTC m=+1610385.769985417: ☑️ agreed by lidezhu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. lgtm size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

downstreamadapter: removed=true close upgrade can be lost during request de-duplication

3 participants