-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Running v1.0.48 (also reproduced on v1.0.49)
5.7.25-28-log Percona Server (GPL)
Problem
After running a migration we saw a significant amount of dataloss which resulted in rolling back to the previous table.
Debugging
When looking through the gh-ost logs to try to determine what exactly had happened, I came across this.
19:45:31.422 - Copy: 17051612/298133316 5.7%; Applied: 57018; Backlog: 1000/1000; Time: 2h29m30s(total), 2h29m30s(copy); streamer: vt-0153850400-bin.013298:1024625; State: migrating; ETA: 41h4m25s
19:45:45.178 - INFO Row copy complete
19:45:45.179 - INFO executing gh-ost-on-before-cut-over hook: <path>
19:45:45.179 - Copy: 17051612/17051612 100.0%; Applied: 57018; Backlog: 997/1000; Time: 2h29m44s(total), 2h29m43s(copy); streamer: vt-0153850400-bin.013298:1037048; State: migrating; ETA: due
19:45:45.855 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 4)
19:45:45.855 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 4)
19:45:46.422 - Copy: 17051612/17051612 100.0%; Applied: 57448; Backlog: 1000/1000; Time: 2h29m45s(total), 2h29m43s(copy); streamer: vt-0153850400-bin.013298:2109787; State: postponing cut-over; ETA: due
19:45:46.963 - [error] binlogsyncer.go:631 connection was bad
19:45:47.963 - [info] binlogsyncer.go:573 begin to re-sync from (vt-0153850400-bin.013299, 1102571)
19:45:47.963 - [info] binlogsyncer.go:203 register slave for master server 172.16.172.155:3306
19:45:47.964 - [info] binlogsyncer.go:237 kill last connection id 290712
19:45:47.964 - [error] binlogsyncer.go:231 kill connection 290712 error ERROR 1094 (HY000): Unknown thread id: 290712
19:45:47.968 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 1102571)
As you can see, we jumped from 5.7% complete, 41hours remaining, and 17_051_612/298_133_316 rows copied, to complete/postponing cutover in just 15seconds. This table has ~300million rows.
Reproducing
After a bit of debugging, I noticed that we had some 300second (our server execution timeout) queries which were timed out by the server milliseconds before the copying "completed". I then set SET GLOBAL MAX_EXECUTION_TIME=1;
on a test database so that any select would instantly timeout and reran gh-ost against that test DB. This instantly marked the migration as postponing cutover. This table has 100k rows in it.
Copy: 0/0 100.0%; Applied: 0; Backlog: 0/1000; Time: 2m25s(total), 1s(copy); streamer: vt-0059014500-bin.012257:233423; State: postponing cut-over; ETA: due
gh-ost
-assume-rbr
-max-load=Threads_running=100
-critical-load=Threads_running=1000
-verbose
-initially-drop-ghost-table
-initially-drop-old-table
-hooks-path=/usr/local/bin/hooks
-postpone-cut-over-flag-file=/tmp/ghost-cutover-postpone
-throttle-control-replicas="REPLICA_HOSTS"
-serve-socket-file=/tmp/gh-ost.sock
-approve-renamed-columns
-database="KEYSPACE"
-table="TABLE"
-host="HOST"
-conf="MYSQL_CONF_PATH"
-alter="ALTER"
-exact-rowcount
-debug
-execute
I looked into this a little further and it looks like the go-sql-driver used by gh-ost just doesn't handle server timeouts and instead treats them as an empty resultset, which signifies there is nothing left to copy since hasFurtherRange
remains false if there are no rows [1].
[1]
Lines 413 to 454 in a8fae98
func (this *Applier) CalculateNextIterationRangeEndValues() (hasFurtherRange bool, err error) { | |
this.migrationContext.MigrationIterationRangeMinValues = this.migrationContext.MigrationIterationRangeMaxValues | |
if this.migrationContext.MigrationIterationRangeMinValues == nil { | |
this.migrationContext.MigrationIterationRangeMinValues = this.migrationContext.MigrationRangeMinValues | |
} | |
for i := 0; i < 2; i++ { | |
buildFunc := sql.BuildUniqueKeyRangeEndPreparedQueryViaOffset | |
if i == 1 { | |
buildFunc = sql.BuildUniqueKeyRangeEndPreparedQueryViaTemptable | |
} | |
query, explodedArgs, err := buildFunc( | |
this.migrationContext.DatabaseName, | |
this.migrationContext.OriginalTableName, | |
&this.migrationContext.UniqueKey.Columns, | |
this.migrationContext.MigrationIterationRangeMinValues.AbstractValues(), | |
this.migrationContext.MigrationRangeMaxValues.AbstractValues(), | |
atomic.LoadInt64(&this.migrationContext.ChunkSize), | |
this.migrationContext.GetIteration() == 0, | |
fmt.Sprintf("iteration:%d", this.migrationContext.GetIteration()), | |
) | |
if err != nil { | |
return hasFurtherRange, err | |
} | |
rows, err := this.db.Query(query, explodedArgs...) | |
if err != nil { | |
return hasFurtherRange, err | |
} | |
iterationRangeMaxValues := sql.NewColumnValues(this.migrationContext.UniqueKey.Len()) | |
for rows.Next() { | |
if err = rows.Scan(iterationRangeMaxValues.ValuesPointers...); err != nil { | |
return hasFurtherRange, err | |
} | |
hasFurtherRange = true | |
} | |
if hasFurtherRange { | |
this.migrationContext.MigrationIterationRangeMaxValues = iterationRangeMaxValues | |
return hasFurtherRange, nil | |
} | |
} | |
log.Debugf("Iteration complete: no further range to iterate") | |
return hasFurtherRange, nil | |
} |