Skip to content

[ML] Jobs fail to auto-close due to timeout when stopping the datafeed #30005

@elasticmachine

Description

@elasticmachine

Original comment by @sophiec20:

Found in 6.3.0 { "build" : { "hash" : "a47564f", "date" : "2018-04-09T21:45:51.347364Z" },

Showcase server, 2 node cluster on gcp

Some jobs failed to auto-close due to a timeout when trying to remove the datafeed persistent task.

System was running 6 concurrent jobs, all on gallery2018 data. These jobs were started 5s apart using the same script. All configs were identifal, except for the function used. The datafeed start request contained the same start and end dates Datafeed started (from: 1970-01-02T10:00:00.000Z to: 2018-12-31T00:00:00.000Z). The 6 jobs had the following configs:

min(bytes) partition=status influencers=clientip,status bucket=1h *
max(bytes) partition=status influencers=clientip,status bucket=1h *
mean(bytes) partition=status influencers=clientip,status bucket=1h *
median(bytes) partition=status influencers=clientip,status bucket=1h
varp(bytes) partition=status influencers=clientip,status bucket=1h
sum(bytes) partition=status influencers=clientip,status bucket=1h

An end date was specified, so I would expect the jobs to auto-close after they had finished processing the data in the index.

All jobs completed, however 3 of the jobs (see *) remained open. Their datafeeds were stopped. The errors are logged below.

[2018-04-11T23:30:34,129][INFO ][o.e.x.m.j.p.DataCountsReporter] [node2] [ga-mean-bytes-p-status-clientip-status-1h] 900000 records written to autodetect; missingFieldCount=0, invalidDateCount=0, outOfOrderCount=0
[2018-04-11T23:30:54,324][INFO ][o.e.x.m.j.p.DataCountsReporter] [node2] [ga-min-bytes-p-status-clientip-status-1h] 1000000 records written to autodetect; missingFieldCount=0, invalidDateCount=0, outOfOrderCount=0
[2018-04-11T23:31:01,912][INFO ][o.e.x.m.j.p.DataCountsReporter] [node2] [ga-sum-bytes-p-status-clientip-status-1h] 700000 records written to autodetect; missingFieldCount=0, invalidDateCount=0, outOfOrderCount=0
[2018-04-11T23:31:16,142][INFO ][o.e.x.m.d.DatafeedJob    ] [ga-min-bytes-p-status-clientip-status-1h] Lookback has finished
[2018-04-11T23:31:16,142][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] attempt to stop datafeed [datafeed-ga-min-bytes-p-status-clientip-status-1h] for job [ga-min-bytes-p-status-clientip-status-1h]
[2018-04-11T23:31:16,143][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] try lock [20s] to stop datafeed [datafeed-ga-min-bytes-p-status-clientip-status-1h] for job [ga-min-bytes-p-status-clientip-status-1h]...
[2018-04-11T23:31:16,143][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] stopping datafeed [datafeed-ga-min-bytes-p-status-clientip-status-1h] for job [ga-min-bytes-p-status-clientip-status-1h], acquired [true]...
[2018-04-11T23:31:16,143][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] datafeed [datafeed-ga-min-bytes-p-status-clientip-status-1h] for job [ga-min-bytes-p-status-clientip-status-1h] has been stopped
[2018-04-11T23:31:16,736][INFO ][o.e.x.m.j.p.DataCountsReporter] [node2] [ga-mean-bytes-p-status-clientip-status-1h] 1000000 records written to autodetect; missingFieldCount=0, invalidDateCount=0, outOfOrderCount=0
[2018-04-11T23:31:31,956][INFO ][o.e.x.m.d.DatafeedJob    ] [ga-mean-bytes-p-status-clientip-status-1h] Lookback has finished
[2018-04-11T23:31:31,956][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] attempt to stop datafeed [datafeed-ga-mean-bytes-p-status-clientip-status-1h] for job [ga-mean-bytes-p-status-clientip-status-1h]
[2018-04-11T23:31:31,956][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] try lock [20s] to stop datafeed [datafeed-ga-mean-bytes-p-status-clientip-status-1h] for job [ga-mean-bytes-p-status-clientip-status-1h]...
[2018-04-11T23:31:31,956][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] stopping datafeed [datafeed-ga-mean-bytes-p-status-clientip-status-1h] for job [ga-mean-bytes-p-status-clientip-status-1h], acquired [true]...
[2018-04-11T23:31:31,957][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] datafeed [datafeed-ga-mean-bytes-p-status-clientip-status-1h] for job [ga-mean-bytes-p-status-clientip-status-1h] has been stopped
[2018-04-11T23:31:36,150][ERROR][o.e.x.m.d.DatafeedManager] Failed to remove datafeed persistent task - will not auto close job [ga-min-bytes-p-status-clientip-status-1h]
java.lang.IllegalStateException: timed out after 20s
        at org.elasticsearch.persistent.PersistentTasksService$WaitForPersistentTaskStatusListener.onTimeout(PersistentTasksService.java:199) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.persistent.PersistentTasksService$1.onTimeout(PersistentTasksService.java:164) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:576) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-04-11T23:31:36,154][INFO ][o.e.x.m.j.p.DataCountsReporter] [node2] [ga-sum-bytes-p-status-clientip-status-1h] 800000 records written to autodetect; missingFieldCount=0, invalidDateCount=0, outOfOrderCount=0
[2018-04-11T23:31:51,958][ERROR][o.e.x.m.d.DatafeedManager] Failed to remove datafeed persistent task - will not auto close job [ga-mean-bytes-p-status-clientip-status-1h]
java.lang.IllegalStateException: timed out after 20s
        at org.elasticsearch.persistent.PersistentTasksService$WaitForPersistentTaskStatusListener.onTimeout(PersistentTasksService.java:199) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.persistent.PersistentTasksService$1.onTimeout(PersistentTasksService.java:164) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:576) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.3.0-SNAPSHOT.jar:6.3.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-04-11T23:32:07,180][INFO ][o.e.x.m.j.p.DataCountsReporter] [node2] [ga-sum-bytes-p-status-clientip-status-1h] 900000 records written to autodetect; missingFieldCount=0, invalidDateCount=0, outOfOrderCount=0
[2018-04-11T23:32:34,798][INFO ][o.e.x.m.j.p.DataCountsReporter] [node2] [ga-sum-bytes-p-status-clientip-status-1h] 1000000 records written to autodetect; missingFieldCount=0, invalidDateCount=0, outOfOrderCount=0
[2018-04-11T23:32:44,739][INFO ][o.e.x.m.d.DatafeedJob    ] [ga-sum-bytes-p-status-clientip-status-1h] Lookback has finished
[2018-04-11T23:32:44,739][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] attempt to stop datafeed [datafeed-ga-sum-bytes-p-status-clientip-status-1h] for job [ga-sum-bytes-p-status-clientip-status-1h]
[2018-04-11T23:32:44,739][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] try lock [20s] to stop datafeed [datafeed-ga-sum-bytes-p-status-clientip-status-1h] for job [ga-sum-bytes-p-status-clientip-status-1h]...
[2018-04-11T23:32:44,739][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] stopping datafeed [datafeed-ga-sum-bytes-p-status-clientip-status-1h] for job [ga-sum-bytes-p-status-clientip-status-1h], acquired [true]...
[2018-04-11T23:32:44,739][INFO ][o.e.x.m.d.DatafeedManager] [no_realtime] datafeed [datafeed-ga-sum-bytes-p-status-clientip-status-1h] for job [ga-sum-bytes-p-status-clientip-status-1h] has been stopped
[2018-04-11T23:33:04,477][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node2] Closing job [ga-sum-bytes-p-status-clientip-status-1h], because [close job (api)]
[2018-04-11T23:33:04,488][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [ga-sum-bytes-p-status-clientip-status-1h] [autodetect/10435] EMAIL REDACTED Handled 1023038 records
[2018-04-11T23:33:04,488][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [ga-sum-bytes-p-status-clientip-status-1h] [autodetect/10435] EMAIL REDACTED Pruning all models
[2018-04-11T23:33:04,589][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcess] [ga-sum-bytes-p-status-clientip-status-1h] State output finished
[2018-04-11T23:33:04,897][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [ga-sum-bytes-p-status-clientip-status-1h] 7259 buckets parsed from autodetect output
[2018-04-11T23:33:09,161][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [ga-sum-bytes-p-status-clientip-status-1h] job closed

The Job messages looked fine, although it was missing the final Job is closing message.

image

Metadata

Metadata

Labels

:mlMachine learning>bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions