Skip to content

Server query timeout lead to data loss  #822

@jarrydlee

Description

@jarrydlee

Running v1.0.48 (also reproduced on v1.0.49)
5.7.25-28-log Percona Server (GPL)

Problem

After running a migration we saw a significant amount of dataloss which resulted in rolling back to the previous table.

Debugging

When looking through the gh-ost logs to try to determine what exactly had happened, I came across this.

19:45:31.422 - Copy: 17051612/298133316 5.7%; Applied: 57018; Backlog: 1000/1000; Time: 2h29m30s(total), 2h29m30s(copy); streamer: vt-0153850400-bin.013298:1024625; State: migrating; ETA: 41h4m25s
19:45:45.178 - INFO Row copy complete
19:45:45.179 - INFO executing gh-ost-on-before-cut-over hook: <path>
19:45:45.179 - Copy: 17051612/17051612 100.0%; Applied: 57018; Backlog: 997/1000; Time: 2h29m44s(total), 2h29m43s(copy); streamer: vt-0153850400-bin.013298:1037048; State: migrating; ETA: due
19:45:45.855 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 4)
19:45:45.855 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 4)
19:45:46.422 - Copy: 17051612/17051612 100.0%; Applied: 57448; Backlog: 1000/1000; Time: 2h29m45s(total), 2h29m43s(copy); streamer: vt-0153850400-bin.013298:2109787; State: postponing cut-over; ETA: due
19:45:46.963 - [error] binlogsyncer.go:631 connection was bad
19:45:47.963 - [info] binlogsyncer.go:573 begin to re-sync from (vt-0153850400-bin.013299, 1102571)
19:45:47.963 - [info] binlogsyncer.go:203 register slave for master server 172.16.172.155:3306
19:45:47.964 - [info] binlogsyncer.go:237 kill last connection id 290712
19:45:47.964 - [error] binlogsyncer.go:231 kill connection 290712 error ERROR 1094 (HY000): Unknown thread id: 290712
19:45:47.968 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 1102571)

As you can see, we jumped from 5.7% complete, 41hours remaining, and 17_051_612/298_133_316 rows copied, to complete/postponing cutover in just 15seconds. This table has ~300million rows.

Reproducing

After a bit of debugging, I noticed that we had some 300second (our server execution timeout) queries which were timed out by the server milliseconds before the copying "completed". I then set SET GLOBAL MAX_EXECUTION_TIME=1; on a test database so that any select would instantly timeout and reran gh-ost against that test DB. This instantly marked the migration as postponing cutover. This table has 100k rows in it.

Copy: 0/0 100.0%; Applied: 0; Backlog: 0/1000; Time: 2m25s(total), 1s(copy); streamer: vt-0059014500-bin.012257:233423; State: postponing cut-over; ETA: due

gh-ost
-assume-rbr
-max-load=Threads_running=100
-critical-load=Threads_running=1000
-verbose
-initially-drop-ghost-table
-initially-drop-old-table
-hooks-path=/usr/local/bin/hooks
-postpone-cut-over-flag-file=/tmp/ghost-cutover-postpone
-throttle-control-replicas="REPLICA_HOSTS"
-serve-socket-file=/tmp/gh-ost.sock
-approve-renamed-columns
-database="KEYSPACE"
-table="TABLE"
-host="HOST"
-conf="MYSQL_CONF_PATH"
-alter="ALTER"
-exact-rowcount
-debug
-execute

I looked into this a little further and it looks like the go-sql-driver used by gh-ost just doesn't handle server timeouts and instead treats them as an empty resultset, which signifies there is nothing left to copy since hasFurtherRange remains false if there are no rows [1].

[1]

gh-ost/go/logic/applier.go

Lines 413 to 454 in a8fae98

func (this *Applier) CalculateNextIterationRangeEndValues() (hasFurtherRange bool, err error) {
this.migrationContext.MigrationIterationRangeMinValues = this.migrationContext.MigrationIterationRangeMaxValues
if this.migrationContext.MigrationIterationRangeMinValues == nil {
this.migrationContext.MigrationIterationRangeMinValues = this.migrationContext.MigrationRangeMinValues
}
for i := 0; i < 2; i++ {
buildFunc := sql.BuildUniqueKeyRangeEndPreparedQueryViaOffset
if i == 1 {
buildFunc = sql.BuildUniqueKeyRangeEndPreparedQueryViaTemptable
}
query, explodedArgs, err := buildFunc(
this.migrationContext.DatabaseName,
this.migrationContext.OriginalTableName,
&this.migrationContext.UniqueKey.Columns,
this.migrationContext.MigrationIterationRangeMinValues.AbstractValues(),
this.migrationContext.MigrationRangeMaxValues.AbstractValues(),
atomic.LoadInt64(&this.migrationContext.ChunkSize),
this.migrationContext.GetIteration() == 0,
fmt.Sprintf("iteration:%d", this.migrationContext.GetIteration()),
)
if err != nil {
return hasFurtherRange, err
}
rows, err := this.db.Query(query, explodedArgs...)
if err != nil {
return hasFurtherRange, err
}
iterationRangeMaxValues := sql.NewColumnValues(this.migrationContext.UniqueKey.Len())
for rows.Next() {
if err = rows.Scan(iterationRangeMaxValues.ValuesPointers...); err != nil {
return hasFurtherRange, err
}
hasFurtherRange = true
}
if hasFurtherRange {
this.migrationContext.MigrationIterationRangeMaxValues = iterationRangeMaxValues
return hasFurtherRange, nil
}
}
log.Debugf("Iteration complete: no further range to iterate")
return hasFurtherRange, nil
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions