Don't lose data!

jordansissel · jordansissel · commit e132231f07e7 · 2017-09-01T13:49:48.000-07:00
* New default retry behavior: Retry until successful * Now makes sure the data is in Kafka before completion. Prior, the default was `retries => 0` which means never retry. The implications of this are that any fault (network failure, Kafka restart, etc), could cause data loss. This commit makes the following changes: * `retries` now has no default value (aka: nil) * Any >=0 value for `retries` will behave the same as it did before. Slight difference in internal behavior in this patch -- We now no longer ignore the Future<RecordMetadata> returned by KafkaProducer.send(). We send the whole batch of events and then wait for all of those operations to complete. If any fail, we retry only the failed transmissions. Prior to this patch, we would call `send()`, which is asynchronous, and then acknowledge in the pipeline. This would cause data loss, even if the PQ was enabled, under the following circumstances: 1) Logstash send() to Kafka then returns -- indicating that the data is in Kafka, which was not true. This means we would ack the transmission to the PQ but Kafka may not have the data yet! 2) Logstash crashes before the KafkaProducer client actually sends it to Kafka. Fixes #149 Test Coverage: * Move specs to call newly-implemented multi_receive This also required a few important changes to the specs: * Mocks (expect..to_receive) were not doing `.and_call_original` so method expectations were returning nil[1] * Old `ssl` setting is now `security_protocol => "SSL"` [1] ProducerRecord.new was returning `nil` due to missing .and_call_original, for exmaple.
diff --git a/docs/index.asciidoc b/docs/index.asciidoc
@@ -291,10 +291,17 @@ retries are exhausted.
 ===== `retries` 
 
   * Value type is <<number,number>>
-  * Default value is `0`
+  * There is no default value for this setting.
+
+The default retry behavior is to retry until successful. To prevent data loss,
+the use of this setting is discouraged.
+
+If you choose to set `retries`, a value greater than zero will cause the
+client to only retry a fixed number of times. This will result in data loss
+if a transport fault exists for longer than your retry count (network outage,
+Kafka down, etc).
 
-Setting a value greater than zero will cause the client to
-resend any record whose send fails with a potentially transient error.
+A value less than zero is a configuration error.
 
 [id="plugins-{type}s-{plugin}-retry_backoff_ms"]
 ===== `retry_backoff_ms` 
diff --git a/lib/logstash/outputs/kafka.rb b/lib/logstash/outputs/kafka.rb
@@ -109,9 +109,15 @@ class LogStash::Outputs::Kafka < LogStash::Outputs::Base
   # elapses the client will resend the request if necessary or fail the request if
   # retries are exhausted.
   config :request_timeout_ms, :validate => :string
-  # Setting a value greater than zero will cause the client to
-  # resend any record whose send fails with a potentially transient error.
-  config :retries, :validate => :number, :default => 0
+  # The default retry behavior is to retry until successful. To prevent data loss,
+  # the use of this setting is discouraged.
+  #
+  # If you choose to set `retries`, a value greater than zero will cause the
+  # client to only retry a fixed number of times. This will result in data loss
+  # if a transient error outlasts your retry count.
+  #
+  # A value less than zero is a configuration error.
+  config :retries, :validate => :number
   # The amount of time to wait before attempting to retry a failed produce request to a given topic partition.
   config :retry_backoff_ms, :validate => :number, :default => 100
   # The size of the TCP send buffer to use when sending data.
@@ -170,6 +176,17 @@ class LogStash::Outputs::Kafka < LogStash::Outputs::Base
 
   public
   def register
+    @thread_batch_map = Concurrent::Hash.new
+
+    if !@retries.nil? 
+      if @retries < 0
+        raise ConfigurationError, "A negative retry count (#{@retries}) is not valid. Must be a value >= 0"
+      end
+
+      @logger.warn("Kafka output is configured with finite retry. This instructs Logstash to LOSE DATA after a set number of send attempts fails. If you do not want to lose data if Kafka is down, then you must remove the retry setting.", :retries => @retries)
+    end
+
+
     @producer = create_producer
     @codec.on_event do |event, data|
       begin
@@ -178,22 +195,80 @@ def register
         else
           record = org.apache.kafka.clients.producer.ProducerRecord.new(event.sprintf(@topic_id), event.sprintf(@message_key), data)
         end
-        @producer.send(record)
+        prepare(record)
       rescue LogStash::ShutdownSignal
         @logger.debug('Kafka producer got shutdown signal')
       rescue => e
         @logger.warn('kafka producer threw exception, restarting',
                      :exception => e)
       end
     end
-
   end # def register
 
-  def receive(event)
-    if event == LogStash::SHUTDOWN
-      return
+  def prepare(record)
+    # This output is threadsafe, so we need to keep a batch per thread.
+    @thread_batch_map[Thread.current].add(record)
+  end
+
+  def multi_receive(events)
+    t = Thread.current
+    if !@thread_batch_map.include?(t)
+      @thread_batch_map[t] = java.util.ArrayList.new(events.size)
+    end
+
+    events.each do |event|
+      break if event == LogStash::SHUTDOWN
+      @codec.encode(event)
+    end
+
+    batch = @thread_batch_map[t]
+    if batch.any?
+      retrying_send(batch)
+      batch.clear
     end
-    @codec.encode(event)
+  end
+
+  def retrying_send(batch)
+    remaining = @retries;
+
+    while batch.any?
+      if !remaining.nil?
+        if remaining < 0
+          # TODO(sissel): Offer to DLQ? Then again, if it's a transient fault,
+          # DLQing would make things worse (you dlq data that would be successful
+          # after the fault is repaired)
+          logger.info("Exhausted user-configured retry count when sending to Kafka. Dropping these events.",
+                      :max_retries => @retries, :drop_count => batch.count)
+          break
+        end
+
+        remaining -= 1
+      end
+
+      futures = batch.collect { |record| @producer.send(record) }
+
+      failures = []
+      futures.each_with_index do |future, i|
+        begin
+          result = future.get()
+        rescue => e
+          # TODO(sissel): Add metric to count failures, possibly by exception type.
+          logger.debug? && logger.debug("KafkaProducer.send() failed: #{e}", :exception => e);
+          failures << batch[i]
+        end
+      end
+
+      # No failures? Cool. Let's move on.
+      break if failures.empty?
+
+      # Otherwise, retry with any failed transmissions
+      batch = failures
+      delay = 1.0 / @retry_backoff_ms
+      logger.info("Sending batch to Kafka failed. Will retry after a delay.", :batch_size => batch.size,
+                  :failures => failures.size, :sleep => delay);
+      sleep(delay)
+    end
+
   end
 
   def close
@@ -217,8 +292,8 @@ def create_producer
       props.put(kafka::MAX_REQUEST_SIZE_CONFIG, max_request_size.to_s)
       props.put(kafka::RECONNECT_BACKOFF_MS_CONFIG, reconnect_backoff_ms) unless reconnect_backoff_ms.nil?
       props.put(kafka::REQUEST_TIMEOUT_MS_CONFIG, request_timeout_ms) unless request_timeout_ms.nil?
-      props.put(kafka::RETRIES_CONFIG, retries.to_s)
-      props.put(kafka::RETRY_BACKOFF_MS_CONFIG, retry_backoff_ms.to_s)
+      props.put(kafka::RETRIES_CONFIG, retries.to_s) unless retries.nil?
+      props.put(kafka::RETRY_BACKOFF_MS_CONFIG, retry_backoff_ms.to_s) 
       props.put(kafka::SEND_BUFFER_CONFIG, send_buffer_bytes.to_s)
       props.put(kafka::VALUE_SERIALIZER_CLASS_CONFIG, value_serializer)
 
diff --git a/spec/integration/outputs/kafka_spec.rb b/spec/integration/outputs/kafka_spec.rb
@@ -157,7 +157,7 @@
   def load_kafka_data(config)
     kafka = LogStash::Outputs::Kafka.new(config)
     kafka.register
-    num_events.times do kafka.receive(event) end
+    kafka.multi_receive(num_events.times.collect { event })
     kafka.close
   end
 
diff --git a/spec/unit/outputs/kafka_spec.rb b/spec/unit/outputs/kafka_spec.rb
@@ -25,34 +25,87 @@
   context 'when outputting messages' do
     it 'should send logstash event to kafka broker' do
       expect_any_instance_of(org.apache.kafka.clients.producer.KafkaProducer).to receive(:send)
-        .with(an_instance_of(org.apache.kafka.clients.producer.ProducerRecord))
+        .with(an_instance_of(org.apache.kafka.clients.producer.ProducerRecord)).and_call_original
       kafka = LogStash::Outputs::Kafka.new(simple_kafka_config)
       kafka.register
-      kafka.receive(event)
+      kafka.multi_receive([event])
     end
 
     it 'should support Event#sprintf placeholders in topic_id' do
       topic_field = 'topic_name'
       expect(org.apache.kafka.clients.producer.ProducerRecord).to receive(:new)
-        .with("my_topic", event.to_s)
-      expect_any_instance_of(org.apache.kafka.clients.producer.KafkaProducer).to receive(:send)
+        .with("my_topic", event.to_s).and_call_original
+      expect_any_instance_of(org.apache.kafka.clients.producer.KafkaProducer).to receive(:send).and_call_original
       kafka = LogStash::Outputs::Kafka.new({'topic_id' => "%{#{topic_field}}"})
       kafka.register
-      kafka.receive(event)
+      kafka.multi_receive([event])
     end
 
     it 'should support field referenced message_keys' do
       expect(org.apache.kafka.clients.producer.ProducerRecord).to receive(:new)
-        .with("test", "172.0.0.1", event.to_s)
-      expect_any_instance_of(org.apache.kafka.clients.producer.KafkaProducer).to receive(:send)
+        .with("test", "172.0.0.1", event.to_s).and_call_original
+      expect_any_instance_of(org.apache.kafka.clients.producer.KafkaProducer).to receive(:send).and_call_original
       kafka = LogStash::Outputs::Kafka.new(simple_kafka_config.merge({"message_key" => "%{host}"}))
       kafka.register
-      kafka.receive(event)
+      kafka.multi_receive([event])
     end
     
     it 'should raise config error when truststore location is not set and ssl is enabled' do
-      kafka = LogStash::Outputs::Kafka.new(simple_kafka_config.merge({"ssl" => "true"}))
+      kafka = LogStash::Outputs::Kafka.new(simple_kafka_config.merge("security_protocol" => "SSL"))
       expect { kafka.register }.to raise_error(LogStash::ConfigurationError, /ssl_truststore_location must be set when SSL is enabled/)
     end
   end
+
+  context "when a send fails" do
+    context "and the default retries behavior is used" do
+      # Fail this many times and then finally succeed.
+      let(:failcount) { (rand * 10).to_i }
+
+      # Expect KafkaProducer.send() to get called again after every failure, plus the successful one.
+      let(:sendcount) { failcount + 1 }
+
+      it "should retry until successful" do
+        count = 0;
+
+        expect_any_instance_of(org.apache.kafka.clients.producer.KafkaProducer).to receive(:send)
+              .exactly(sendcount).times
+              .and_wrap_original do |m, *args|
+          if count < failcount
+            p count  => failcount
+            count += 1
+            # inject some failures.
+
+            # Return a custom Future that will raise an exception to simulate a Kafka send() problem.
+            future = java.util.concurrent.FutureTask.new(java.util.concurrent.Callable.new { raise "Failed" })
+            future.run
+            future
+          else
+            m.call(*args)
+          end
+        end
+        kafka = LogStash::Outputs::Kafka.new(simple_kafka_config)
+        kafka.register
+        kafka.multi_receive([event])
+      end
+    end
+
+    context "and when retries is set by the user" do
+      let(:retries) { (rand * 10).to_i }
+      let(:max_sends) { retries + 1 }
+
+      it "should give up after retries are exhausted" do
+        expect_any_instance_of(org.apache.kafka.clients.producer.KafkaProducer).to receive(:send)
+              .at_most(max_sends).times
+              .and_wrap_original do |m, *args|
+          # Always fail.
+          future = java.util.concurrent.FutureTask.new(java.util.concurrent.Callable.new { raise "Failed" })
+          future.run
+          future
+        end
+        kafka = LogStash::Outputs::Kafka.new(simple_kafka_config.merge("retries" => retries))
+        kafka.register
+        kafka.multi_receive([event])
+      end
+    end
+  end
 end