Skip to content

cut over not retrying ? #971

@babinomec

Description

@babinomec

Hi,

When running gh-ost, the cut-over part often fails due to timeout (unfortunately our slaves often have some lags / spikes of lag). This is not a problem in itself but it does not seem to wait and retry at a later time which I think it is supposed to do.

I am running version 1.1.1

(redacted command and log)

This is what I am running (alter just adds some columns) :

./gh-ost \
  --user="ghost" \
  --password="********" \
  --host="slave" \
  --allow-master-master \
  --assume-master-host="master" \
  --database="db" \
  --table="table" \
  --alter="alter" \
  --verbose \
  --skip-foreign-key-checks \
  --skip-strict-mode \
  --initially-drop-ghost-table \
  --initially-drop-old-table \
  --exact-rowcount \
  --ok-to-drop-table \
  --max-lag-millis=10000 \
  --execute

Here is the output at the cut-over phase :

Copy: 14327000/14362256 99.8%; Applied: 12041; Backlog: 0/1000; Time: 1h16m10s(total), 1h15m58s(copy); streamer: bin-log.005490:906474118; Lag: 2.68s, State: migrating; ETA: 11s
Copy: 14334000/14362256 99.8%; Applied: 12047; Backlog: 2/1000; Time: 1h16m11s(total), 1h15m59s(copy); streamer: bin-log.005490:920606200; Lag: 2.38s, State: migrating; ETA: 8s
Copy: 14342000/14362256 99.9%; Applied: 12053; Backlog: 0/1000; Time: 1h16m12s(total), 1h16m0s(copy); streamer: bin-log.005490:934403718; Lag: 1.78s, State: migrating; ETA: 6s
Copy: 14350000/14362256 99.9%; Applied: 12055; Backlog: 0/1000; Time: 1h16m13s(total), 1h16m1s(copy); streamer: bin-log.005490:947459950; Lag: 1.48s, State: migrating; ETA: 3s
Copy: 14359000/14362256 100.0%; Applied: 12058; Backlog: 0/1000; Time: 1h16m14s(total), 1h16m2s(copy); streamer: bin-log.005490:961146997; Lag: 1.18s, State: migrating; ETA: 1s
2021-05-18 18:18:18 INFO Row copy complete
Copy: 14362256/14362256 100.0%; Applied: 12060; Backlog: 0/1000; Time: 1h16m14s(total), 1h16m2s(copy); streamer: bin-log.005490:966837215; Lag: 1.08s, State: migrating; ETA: due
2021-05-18 18:18:18 INFO Grabbing voluntary lock: gh-ost.1024388992.lock
2021-05-18 18:18:18 INFO Setting LOCK timeout as 6 seconds
2021-05-18 18:18:18 INFO Looking for magic cut-over table
2021-05-18 18:18:18 INFO Creating magic cut-over table `db`.`_table_del`
2021-05-18 18:18:18 INFO Magic cut-over table created
2021-05-18 18:18:18 INFO Locking `db`.`table`, `db`.`_table_del`
2021-05-18 18:18:18 INFO Tables locked
2021-05-18 18:18:18 INFO Session locking original & magic tables is 1024388992
2021-05-18 18:18:18 INFO Writing changelog state: AllEventsUpToLockProcessed:1621354698945390596
2021-05-18 18:18:18 INFO Waiting for events up to lock
Copy: 14362256/14362256 100.0%; Applied: 12060; Backlog: 0/1000; Time: 1h16m15s(total), 1h16m2s(copy); streamer: bin-log.005490:970667197; Lag: 1.28s, State: migrating; ETA: due
2021-05-18 18:18:21 ERROR Timeout while waiting for events up to lock
2021-05-18 18:18:21 ERROR 2021-05-18 18:18:21 ERROR Timeout while waiting for events up to lock
2021-05-18 18:18:21 INFO Looking for magic cut-over table
2021-05-18 18:18:21 INFO Will now proceed to drop magic table and unlock tables
2021-05-18 18:18:21 INFO Dropping magic cut-over table
2021-05-18 18:18:21 INFO Dropping magic cut-over table
2021-05-18 18:18:21 INFO Dropping table `db`.`_table_del`
2021-05-18 18:18:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2021-05-18 18:18:24 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 1/1000; Time: 1h16m20s(total), 1h16m2s(copy); streamer: bin-log.005490:989474686; Lag: 3.18s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m25s(total), 1h16m2s(copy); streamer: bin-log.005490:1014763533; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m30s(total), 1h16m2s(copy); streamer: bin-log.005490:1026013234; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m35s(total), 1h16m2s(copy); streamer: bin-log.005490:1036196578; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m40s(total), 1h16m2s(copy); streamer: bin-log.005490:1047207998; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m45s(total), 1h16m2s(copy); streamer: bin-log.005490:1057140917; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m50s(total), 1h16m2s(copy); streamer: bin-log.005490:1062986858; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m55s(total), 1h16m2s(copy); streamer: bin-log.005490:1064318221; Lag: 4.48s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m0s(total), 1h16m2s(copy); streamer: bin-log.005490:1073646385; Lag: 4.78s, State: migrating; ETA: due
[2021/05/18 18:19:04] [info] binlogsyncer.go:723 rotate to (bin-log.005491, 4)
[2021/05/18 18:19:04] [info] binlogsyncer.go:723 rotate to (bin-log.005491, 4)
2021-05-18 18:19:04 INFO rotate to next log from bin-log.005491:1073742500 to bin-log.005491
2021-05-18 18:19:04 INFO rotate to next log from bin-log.005491:0 to bin-log.005491
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m5s(total), 1h16m2s(copy); streamer: bin-log.005491:14353974; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m10s(total), 1h16m2s(copy); streamer: bin-log.005491:26928263; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m30s(total), 1h16m2s(copy); streamer: bin-log.005491:64743000; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h18m0s(total), 1h16m2s(copy); streamer: bin-log.005491:118334528; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h18m30s(total), 1h16m2s(copy); streamer: bin-log.005491:176850550; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h19m0s(total), 1h16m2s(copy); streamer: bin-log.005491:236619046; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h19m30s(total), 1h16m2s(copy); streamer: bin-log.005491:292654898; Lag: 0.08s, State: migrating; ETA: due
# Migrating `db`.`table`; Ghost table is `db`.`_table_gho`
# Migrating master:3306; inspecting slave:3306; executing on slave
# Migration started at Tue May 18 17:02:04 +0200 2021
# chunk-size: 1000; max-lag-millis: 10000ms; dml-batch-size: 10; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# Serving on unix socket: /tmp/gh-ost.db.table.sock
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h20m0s(total), 1h16m2s(copy); streamer: bin-log.005491:359697150; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h20m30s(total), 1h16m2s(copy); streamer: bin-log.005491:427123115; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h21m0s(total), 1h16m2s(copy); streamer: bin-log.005491:498372220; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h21m30s(total), 1h16m2s(copy); streamer: bin-log.005491:564757429; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h22m0s(total), 1h16m2s(copy); streamer: bin-log.005491:627971561; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h22m30s(total), 1h16m2s(copy); streamer: bin-log.005491:700060664; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h23m0s(total), 1h16m2s(copy); streamer: bin-log.005491:762556638; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h23m30s(total), 1h16m2s(copy); streamer: bin-log.005491:830434663; Lag: 1.18s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h24m0s(total), 1h16m2s(copy); streamer: bin-log.005491:898773371; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h24m30s(total), 1h16m2s(copy); streamer: bin-log.005491:972940697; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h25m0s(total), 1h16m2s(copy); streamer: bin-log.005491:1033621038; Lag: 0.08s, State: migrating; ETA: due

And it goes on like that even when lag is low and that cut-over could happen. I ctrl-c-ed it after 2h50m.

Any clue on what may be happening and how to fix it ?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions