-
Notifications
You must be signed in to change notification settings - Fork 76
New default retry behavior: Retry until successful #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
TODO items:
|
|
The tests are failing for me locally but some of the failures are not due to this PR. |
fd5bae5 to
d96bed3
Compare
|
Ok code done, tests written, docs updated. Ready for review! |
* New default retry behavior: Retry until successful * Now makes sure the data is in Kafka before completion. Prior, the default was `retries => 0` which means never retry. The implications of this are that any fault (network failure, Kafka restart, etc), could cause data loss. This commit makes the following changes: * `retries` now has no default value (aka: nil) * Any >=0 value for `retries` will behave the same as it did before. Slight difference in internal behavior in this patch -- We now no longer ignore the Future<RecordMetadata> returned by KafkaProducer.send(). We send the whole batch of events and then wait for all of those operations to complete. If any fail, we retry only the failed transmissions. Prior to this patch, we would call `send()`, which is asynchronous, and then acknowledge in the pipeline. This would cause data loss, even if the PQ was enabled, under the following circumstances: 1) Logstash send() to Kafka then returns -- indicating that the data is in Kafka, which was not true. This means we would ack the transmission to the PQ but Kafka may not have the data yet! 2) Logstash crashes before the KafkaProducer client actually sends it to Kafka. Fixes #149 Test Coverage: * Move specs to call newly-implemented multi_receive This also required a few important changes to the specs: * Mocks (expect..to_receive) were not doing `.and_call_original` so method expectations were returning nil[1] * Old `ssl` setting is now `security_protocol => "SSL"` [1] ProducerRecord.new was returning `nil` due to missing .and_call_original, for exmaple.
d96bed3 to
e132231
Compare
lib/logstash/outputs/kafka.rb
Outdated
| remaining -= 1 | ||
| end | ||
|
|
||
| futures = batch.collect { |record| @producer.send(record) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jordansissel one problem I see here is this:
The default max.block.ms (timeout on a send that can block if either the Kafka client's output buffer is full or fetching metadata is blocked) is 60s.
So for Kafka outages taking more than max.block.ms + (max metadata age) we'll start loosing data won't we (we will only catch these way upstream and just move on the next batch right now I think)?
I think we should catch these and retry the send calls with a back-off on org.apache.kafka.common.errors.TimeoutException right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. Yeah, I didn't check what exceptions can be thrown here.
I'll add handling for the 3 listed here: https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I added coverage for the 3 exceptions thrown by KafkaProducer.send() and added test coverage for it as well.
|
I left a TODO item to handle SerializationExceptions by DLQing them since I felt DLQ was out of scope for this PR. |
|
@jordansissel LGTM :) |
|
Didn't realize there was this limitation in the kafka output plugin. When are you guys planning to merge this and release it to public? |
|
@jordansissel /cc @ppf2 |
It's a bug, not a feature. |
Agreed. There is no '6.x' branch, but I will make one. |
* New default retry behavior: Retry until successful * Now makes sure the data is in Kafka before completion. Prior, the default was `retries => 0` which means never retry. The implications of this are that any fault (network failure, Kafka restart, etc), could cause data loss. This commit makes the following changes: * `retries` now has no default value (aka: nil) * Any >=0 value for `retries` will behave the same as it did before. Slight difference in internal behavior in this patch -- We now no longer ignore the Future<RecordMetadata> returned by KafkaProducer.send(). We send the whole batch of events and then wait for all of those operations to complete. If any fail, we retry only the failed transmissions. Prior to this patch, we would call `send()`, which is asynchronous, and then acknowledge in the pipeline. This would cause data loss, even if the PQ was enabled, under the following circumstances: 1) Logstash send() to Kafka then returns -- indicating that the data is in Kafka, which was not true. This means we would ack the transmission to the PQ but Kafka may not have the data yet! 2) Logstash crashes before the KafkaProducer client actually sends it to Kafka. Fixes #149 Test Coverage: * Move specs to call newly-implemented multi_receive This also required a few important changes to the specs: * Mocks (expect..to_receive) were not doing `.and_call_original` so method expectations were returning nil[1] * Old `ssl` setting is now `security_protocol => "SSL"` [1] ProducerRecord.new was returning `nil` due to missing .and_call_original, for exmaple. Fixes #151
* New default retry behavior: Retry until successful * Now makes sure the data is in Kafka before completion. Prior, the default was `retries => 0` which means never retry. The implications of this are that any fault (network failure, Kafka restart, etc), could cause data loss. This commit makes the following changes: * `retries` now has no default value (aka: nil) * Any >=0 value for `retries` will behave the same as it did before. Slight difference in internal behavior in this patch -- We now no longer ignore the Future<RecordMetadata> returned by KafkaProducer.send(). We send the whole batch of events and then wait for all of those operations to complete. If any fail, we retry only the failed transmissions. Prior to this patch, we would call `send()`, which is asynchronous, and then acknowledge in the pipeline. This would cause data loss, even if the PQ was enabled, under the following circumstances: 1) Logstash send() to Kafka then returns -- indicating that the data is in Kafka, which was not true. This means we would ack the transmission to the PQ but Kafka may not have the data yet! 2) Logstash crashes before the KafkaProducer client actually sends it to Kafka. Fixes #149 Test Coverage: * Move specs to call newly-implemented multi_receive This also required a few important changes to the specs: * Mocks (expect..to_receive) were not doing `.and_call_original` so method expectations were returning nil[1] * Old `ssl` setting is now `security_protocol => "SSL"` [1] ProducerRecord.new was returning `nil` due to missing .and_call_original, for exmaple. Fixes #151
|
This fixed says that logstash failed to send file to kafka will retry till send success. But when logstash retrying, where are the files? Is there any path queue in logstash? |
Don't lose data!
Prior, the default was
retries => 0which means never retry.The implications of this are that any fault (network failure, Kafka
restart, etc), could cause data loss.
This commit makes the following changes:
retriesnow has no default value (aka: nil)retrieswill behave the same as it did before.Slight difference in internal behavior in this patch -- We now no longer
ignore the Future returned by KafkaProducer.send(). We
send the whole batch of events and then wait for all of those operations
to complete. If any fail, we retry only the failed transmissions.
Prior to this patch, we would call
send(), which is asynchronous, andthen acknowledge in the pipeline. This would cause data loss, even if
the PQ was enabled, under the following circumstances:
in Kafka, which was not true. This means we would ack the
transmission to the PQ but Kafka may not have the data yet!
Kafka.
Fixes #149
Test Coverage:
This also required a few important changes to the specs:
.and_call_originalsomethod expectations were returning nil[1]
sslsetting is nowsecurity_protocol => "SSL"[1] ProducerRecord.new was returning
nildue to missing.and_call_original, for exmaple.