Server query timeout lead to data loss 

Running v1.0.48 (also reproduced on v1.0.49)
5.7.25-28-log Percona Server (GPL)

### Problem
After running a migration we saw a significant amount of dataloss which resulted in rolling back to the previous table. 

### Debugging
When looking through the gh-ost logs to try to determine what exactly had happened, I came across this.

```
19:45:31.422 - Copy: 17051612/298133316 5.7%; Applied: 57018; Backlog: 1000/1000; Time: 2h29m30s(total), 2h29m30s(copy); streamer: vt-0153850400-bin.013298:1024625; State: migrating; ETA: 41h4m25s
19:45:45.178 - INFO Row copy complete
19:45:45.179 - INFO executing gh-ost-on-before-cut-over hook: <path>
19:45:45.179 - Copy: 17051612/17051612 100.0%; Applied: 57018; Backlog: 997/1000; Time: 2h29m44s(total), 2h29m43s(copy); streamer: vt-0153850400-bin.013298:1037048; State: migrating; ETA: due
19:45:45.855 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 4)
19:45:45.855 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 4)
19:45:46.422 - Copy: 17051612/17051612 100.0%; Applied: 57448; Backlog: 1000/1000; Time: 2h29m45s(total), 2h29m43s(copy); streamer: vt-0153850400-bin.013298:2109787; State: postponing cut-over; ETA: due
19:45:46.963 - [error] binlogsyncer.go:631 connection was bad
19:45:47.963 - [info] binlogsyncer.go:573 begin to re-sync from (vt-0153850400-bin.013299, 1102571)
19:45:47.963 - [info] binlogsyncer.go:203 register slave for master server 172.16.172.155:3306
19:45:47.964 - [info] binlogsyncer.go:237 kill last connection id 290712
19:45:47.964 - [error] binlogsyncer.go:231 kill connection 290712 error ERROR 1094 (HY000): Unknown thread id: 290712
19:45:47.968 - [info] binlogsyncer.go:723 rotate to (vt-0153850400-bin.013299, 1102571)

```

As you can see, we jumped from 5.7% complete, 41hours remaining, and 17_051_612/298_133_316 rows copied, to complete/postponing cutover in just 15seconds. This table has ~300million rows. 

### Reproducing
After a bit of debugging, I noticed that we had some 300second (our server execution timeout) queries which were timed out by the server milliseconds before the copying "completed". I then set `SET GLOBAL MAX_EXECUTION_TIME=1;` on a test database so that any select would instantly timeout and reran gh-ost against that test DB. This instantly marked the migration as postponing cutover. This table has 100k rows in it.

```
Copy: 0/0 100.0%; Applied: 0; Backlog: 0/1000; Time: 2m25s(total), 1s(copy); streamer: vt-0059014500-bin.012257:233423; State: postponing cut-over; ETA: due
```

---
```
gh-ost
-assume-rbr
-max-load=Threads_running=100
-critical-load=Threads_running=1000
-verbose
-initially-drop-ghost-table
-initially-drop-old-table
-hooks-path=/usr/local/bin/hooks
-postpone-cut-over-flag-file=/tmp/ghost-cutover-postpone
-throttle-control-replicas="REPLICA_HOSTS"
-serve-socket-file=/tmp/gh-ost.sock
-approve-renamed-columns
-database="KEYSPACE"
-table="TABLE"
-host="HOST"
-conf="MYSQL_CONF_PATH"
-alter="ALTER"
-exact-rowcount
-debug
-execute
```
---

I looked into this a little further and it looks like the go-sql-driver used by gh-ost just doesn't handle server timeouts and instead treats them as an empty resultset, which signifies there is nothing left to copy since `hasFurtherRange` remains false if there are no rows [1].

[1] https://github.com/github/gh-ost/blob/a8fae9818daa42681c7806f843067e2d5d360273/go/logic/applier.go#L413-L454


	func (this *Applier) CalculateNextIterationRangeEndValues() (hasFurtherRange bool, err error) {
	this.migrationContext.MigrationIterationRangeMinValues = this.migrationContext.MigrationIterationRangeMaxValues
	if this.migrationContext.MigrationIterationRangeMinValues == nil {
	this.migrationContext.MigrationIterationRangeMinValues = this.migrationContext.MigrationRangeMinValues
	}
	for i := 0; i < 2; i++ {
	buildFunc := sql.BuildUniqueKeyRangeEndPreparedQueryViaOffset
	if i == 1 {
	buildFunc = sql.BuildUniqueKeyRangeEndPreparedQueryViaTemptable
	}
	query, explodedArgs, err := buildFunc(
	this.migrationContext.DatabaseName,
	this.migrationContext.OriginalTableName,
	&this.migrationContext.UniqueKey.Columns,
	this.migrationContext.MigrationIterationRangeMinValues.AbstractValues(),
	this.migrationContext.MigrationRangeMaxValues.AbstractValues(),
	atomic.LoadInt64(&this.migrationContext.ChunkSize),
	this.migrationContext.GetIteration() == 0,
	fmt.Sprintf("iteration:%d", this.migrationContext.GetIteration()),
	)
	if err != nil {
	return hasFurtherRange, err
	}
	rows, err := this.db.Query(query, explodedArgs...)
	if err != nil {
	return hasFurtherRange, err
	}
	iterationRangeMaxValues := sql.NewColumnValues(this.migrationContext.UniqueKey.Len())
	for rows.Next() {
	if err = rows.Scan(iterationRangeMaxValues.ValuesPointers...); err != nil {
	return hasFurtherRange, err
	}
	hasFurtherRange = true
	}
	if hasFurtherRange {
	this.migrationContext.MigrationIterationRangeMaxValues = iterationRangeMaxValues
	return hasFurtherRange, nil
	}
	}
	log.Debugf("Iteration complete: no further range to iterate")
	return hasFurtherRange, nil
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Server query timeout lead to data loss #822

Problem

Debugging

Reproducing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server query timeout lead to data loss #822

Description

Problem

Debugging

Reproducing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions