New default retry behavior: Retry until successful #151

jordansissel · 2017-08-31T04:59:37Z

Don't lose data!

New default retry behavior: Retry until successful
Now makes sure the data is in Kafka before completion.

Prior, the default was retries => 0 which means never retry.

The implications of this are that any fault (network failure, Kafka
restart, etc), could cause data loss.

This commit makes the following changes:

retries now has no default value (aka: nil)
Any >=0 value for retries will behave the same as it did before.

Slight difference in internal behavior in this patch -- We now no longer
ignore the Future returned by KafkaProducer.send(). We
send the whole batch of events and then wait for all of those operations
to complete. If any fail, we retry only the failed transmissions.

Prior to this patch, we would call send(), which is asynchronous, and
then acknowledge in the pipeline. This would cause data loss, even if
the PQ was enabled, under the following circumstances:

Logstash send() to Kafka then returns -- indicating that the data is
in Kafka, which was not true. This means we would ack the
transmission to the PQ but Kafka may not have the data yet!
Logstash crashes before the KafkaProducer client actually sends it to
Kafka.

Fixes #149

Test Coverage:

Move specs to call newly-implemented multi_receive

This also required a few important changes to the specs:

Mocks (expect..to_receive) were not doing .and_call_original so
method expectations were returning nil[1]
Old ssl setting is now security_protocol => "SSL"

[1] ProducerRecord.new was returning nil due to missing
.and_call_original, for exmaple.

jordansissel · 2017-08-31T05:00:26Z

TODO items:

Verify manually that this works
Add test coverage
Backport to plugin major branches 5.x and 6.x?

jordansissel · 2017-08-31T05:04:33Z

The tests are failing for me locally but some of the failures are not due to this PR.

jordansissel · 2017-09-01T20:48:38Z

Ok code done, tests written, docs updated. Ready for review!

* New default retry behavior: Retry until successful * Now makes sure the data is in Kafka before completion. Prior, the default was `retries => 0` which means never retry. The implications of this are that any fault (network failure, Kafka restart, etc), could cause data loss. This commit makes the following changes: * `retries` now has no default value (aka: nil) * Any >=0 value for `retries` will behave the same as it did before. Slight difference in internal behavior in this patch -- We now no longer ignore the Future<RecordMetadata> returned by KafkaProducer.send(). We send the whole batch of events and then wait for all of those operations to complete. If any fail, we retry only the failed transmissions. Prior to this patch, we would call `send()`, which is asynchronous, and then acknowledge in the pipeline. This would cause data loss, even if the PQ was enabled, under the following circumstances: 1) Logstash send() to Kafka then returns -- indicating that the data is in Kafka, which was not true. This means we would ack the transmission to the PQ but Kafka may not have the data yet! 2) Logstash crashes before the KafkaProducer client actually sends it to Kafka. Fixes #149 Test Coverage: * Move specs to call newly-implemented multi_receive This also required a few important changes to the specs: * Mocks (expect..to_receive) were not doing `.and_call_original` so method expectations were returning nil[1] * Old `ssl` setting is now `security_protocol => "SSL"` [1] ProducerRecord.new was returning `nil` due to missing .and_call_original, for exmaple.

original-brownbear · 2017-09-02T13:09:48Z

lib/logstash/outputs/kafka.rb

+        remaining -= 1
+      end
+
+      futures = batch.collect { |record| @producer.send(record) }


@jordansissel one problem I see here is this:
The default max.block.ms (timeout on a send that can block if either the Kafka client's output buffer is full or fetching metadata is blocked) is 60s.
So for Kafka outages taking more than max.block.ms + (max metadata age) we'll start loosing data won't we (we will only catch these way upstream and just move on the next batch right now I think)?
I think we should catch these and retry the send calls with a back-off on org.apache.kafka.common.errors.TimeoutException right?

Hmm.. Yeah, I didn't check what exceptions can be thrown here.

I'll add handling for the 3 listed here: https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html

Ok I added coverage for the 3 exceptions thrown by KafkaProducer.send() and added test coverage for it as well.

jordansissel · 2017-09-07T02:29:25Z

I left a TODO item to handle SerializationExceptions by DLQing them since I felt DLQ was out of scope for this PR.

original-brownbear · 2017-09-07T11:01:53Z

@jordansissel LGTM :)

allenmchan · 2017-09-21T20:21:00Z

Didn't realize there was this limitation in the kafka output plugin. When are you guys planning to merge this and release it to public?

guyboertje · 2017-09-27T08:18:05Z

@jordansissel
Are there any reasons against a back-port to 5.x and 6.x?

/cc @ppf2

elasticsearch-bot · 2017-09-28T04:20:37Z

Jordan Sissel merged this into the following branches!

Branch	Commits
master	`1cb7b4d`, `9978749`

Fixes #151

jordansissel · 2017-09-28T04:21:12Z

Didn't realize there was this limitation in the kafka output plugin

It's a bug, not a feature.

jordansissel · 2017-09-28T04:24:50Z

Are there any reasons against a back-port to 5.x and 6.x?

Agreed. There is no '6.x' branch, but I will make one.

elasticsearch-bot · 2017-09-28T04:25:02Z

Jordan Sissel merged this into the following branches!

Branch	Commits
5.x	`aa0341a`, `9d49f53`

* New default retry behavior: Retry until successful * Now makes sure the data is in Kafka before completion. Prior, the default was `retries => 0` which means never retry. The implications of this are that any fault (network failure, Kafka restart, etc), could cause data loss. This commit makes the following changes: * `retries` now has no default value (aka: nil) * Any >=0 value for `retries` will behave the same as it did before. Slight difference in internal behavior in this patch -- We now no longer ignore the Future<RecordMetadata> returned by KafkaProducer.send(). We send the whole batch of events and then wait for all of those operations to complete. If any fail, we retry only the failed transmissions. Prior to this patch, we would call `send()`, which is asynchronous, and then acknowledge in the pipeline. This would cause data loss, even if the PQ was enabled, under the following circumstances: 1) Logstash send() to Kafka then returns -- indicating that the data is in Kafka, which was not true. This means we would ack the transmission to the PQ but Kafka may not have the data yet! 2) Logstash crashes before the KafkaProducer client actually sends it to Kafka. Fixes #149 Test Coverage: * Move specs to call newly-implemented multi_receive This also required a few important changes to the specs: * Mocks (expect..to_receive) were not doing `.and_call_original` so method expectations were returning nil[1] * Old `ssl` setting is now `security_protocol => "SSL"` [1] ProducerRecord.new was returning `nil` due to missing .and_call_original, for exmaple. Fixes #151

Fixes #151

elasticsearch-bot · 2017-09-28T04:25:41Z

Jordan Sissel merged this into the following branches!

Branch	Commits
6.x	`b09fd17`, `3f8ff91`

* New default retry behavior: Retry until successful * Now makes sure the data is in Kafka before completion. Prior, the default was `retries => 0` which means never retry. The implications of this are that any fault (network failure, Kafka restart, etc), could cause data loss. This commit makes the following changes: * `retries` now has no default value (aka: nil) * Any >=0 value for `retries` will behave the same as it did before. Slight difference in internal behavior in this patch -- We now no longer ignore the Future<RecordMetadata> returned by KafkaProducer.send(). We send the whole batch of events and then wait for all of those operations to complete. If any fail, we retry only the failed transmissions. Prior to this patch, we would call `send()`, which is asynchronous, and then acknowledge in the pipeline. This would cause data loss, even if the PQ was enabled, under the following circumstances: 1) Logstash send() to Kafka then returns -- indicating that the data is in Kafka, which was not true. This means we would ack the transmission to the PQ but Kafka may not have the data yet! 2) Logstash crashes before the KafkaProducer client actually sends it to Kafka. Fixes #149 Test Coverage: * Move specs to call newly-implemented multi_receive This also required a few important changes to the specs: * Mocks (expect..to_receive) were not doing `.and_call_original` so method expectations were returning nil[1] * Old `ssl` setting is now `security_protocol => "SSL"` [1] ProducerRecord.new was returning `nil` due to missing .and_call_original, for exmaple. Fixes #151

Fixes #151

Fixes #159

Fixes #160

Fixes #158

TiaraH · 2018-03-06T01:33:37Z

This fixed says that logstash failed to send file to kafka will retry till send success. But when logstash retrying, where are the files? Is there any path queue in logstash?

jordansissel changed the title ~~[WIP] New default retry behavior: Retry until successful~~ New default retry behavior: Retry until successful Sep 1, 2017

jordansissel requested a review from original-brownbear September 1, 2017 20:42

jordansissel force-pushed the issue/149 branch 2 times, most recently from fd5bae5 to d96bed3 Compare September 1, 2017 20:48

jordansissel force-pushed the issue/149 branch from d96bed3 to e132231 Compare September 1, 2017 20:49

original-brownbear self-assigned this Sep 2, 2017

original-brownbear reviewed Sep 2, 2017

View reviewed changes

Handle exceptions thrown from KafkaProducer.send()

d35e31a

elasticsearch-bot closed this in 1cb7b4d Sep 28, 2017

elasticsearch-bot pushed a commit that referenced this pull request Sep 28, 2017

Handle exceptions thrown from KafkaProducer.send()

9978749

Fixes #151

elasticsearch-bot pushed a commit that referenced this pull request Sep 28, 2017

Handle exceptions thrown from KafkaProducer.send()

9d49f53

Fixes #151

elasticsearch-bot pushed a commit that referenced this pull request Sep 28, 2017

Handle exceptions thrown from KafkaProducer.send()

3f8ff91

Fixes #151

jordansissel mentioned this pull request Sep 28, 2017

Release new versions from 5.x, 6.x, and 7.x branches. #157

Closed

6 tasks

jordansissel added a commit that referenced this pull request Oct 2, 2017

Add changelog entry for #151

ed4265b

jordansissel added a commit that referenced this pull request Oct 2, 2017

Add changelog entry for #151

0829617

jordansissel added a commit that referenced this pull request Oct 2, 2017

Add changelog entry for #151

3737632

elasticsearch-bot pushed a commit that referenced this pull request Oct 6, 2017

Add changelog entry for #151

e90e8d1

Fixes #159

elasticsearch-bot pushed a commit that referenced this pull request Oct 6, 2017

Add changelog entry for #151

c25e110

Fixes #160

elasticsearch-bot pushed a commit that referenced this pull request Oct 6, 2017

Add changelog entry for #151

18d9f39

Fixes #158

ppf2 mentioned this pull request Dec 11, 2017

DLQ for Kafka output #167

Open

robbavey added v6.0.0 v5.6.4 labels Feb 2, 2018

magazov mentioned this pull request Feb 6, 2018

Problems detecting when Kafka server is down #48

Open

praseodym mentioned this pull request Apr 13, 2020

Retry sending messages only for retriable exceptions logstash-plugins/logstash-integration-kafka#29

Merged

New default retry behavior: Retry until successful #151

New default retry behavior: Retry until successful #151

Uh oh!

Conversation

jordansissel commented Aug 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordansissel commented Aug 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordansissel commented Aug 31, 2017

Uh oh!

jordansissel commented Sep 1, 2017

Uh oh!

original-brownbear Sep 2, 2017

Choose a reason for hiding this comment

Uh oh!

jordansissel Sep 7, 2017

Choose a reason for hiding this comment

Uh oh!

jordansissel Sep 7, 2017

Choose a reason for hiding this comment

Uh oh!

jordansissel commented Sep 7, 2017

Uh oh!

original-brownbear commented Sep 7, 2017

Uh oh!

allenmchan commented Sep 21, 2017

Uh oh!

guyboertje commented Sep 27, 2017

Uh oh!

elasticsearch-bot commented Sep 28, 2017

Uh oh!

jordansissel commented Sep 28, 2017

Uh oh!

jordansissel commented Sep 28, 2017

Uh oh!

elasticsearch-bot commented Sep 28, 2017

Uh oh!

elasticsearch-bot commented Sep 28, 2017

Uh oh!

TiaraH commented Mar 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jordansissel commented Aug 31, 2017 •

edited

Loading

jordansissel commented Aug 31, 2017 •

edited

Loading