Quickwit merge pipelines begins to fail during indexing

While indexing the [GitHub Archive dataset](https://www.gharchive.org/) of about 1 month of data, I noticed Quickwit seems to start intermittently failing to run merge pipelines while indexing and has some unusual affects:

- A merge will occur and attempt merge splits `A`, `B`, `C` this operation succeeds.
- The system will mark the splits for deletion
- Sometime later a second merge operation gets scheduled to attempt to merge split `C` again, this operation fails.
- Over time this happens more often.

Logs:
[logs.txt](https://github.com/quickwit-oss/quickwit/files/10996502/logs.txt)

Reproduction:
The issue seems to be somewhat intermittent, but occurs on a recent update, the easiest way to try reproduce would be via docker:

- Download the 2015 GH archive dataset for the month of January.
- Run the `chillfish8/quickwit-main:latest` docker image (it's just a docker build of what ever version of `main` is I'm testing.
- Feed the data into quickwit

You may need to increase the concurrent connections pushing data into quickwit in order to create this instability, I first saw this issue when running 4+ simultaneous connections.

Index File:
```yaml
version: "0.4"
index_id: gharchive

doc_mapping:
  store_source: true
  field_mappings:
    - name: id
      type: text
      tokenizer: raw
    - name: type
      type: text
      tokenizer: raw
      fast: true
    - name: actor
      type: object
      field_mappings:
        - name: id
          type: u64
          fast: true
        - name: login
          type: text
          fast: true
          tokenizer: raw
        - name: gravatar_id
          type: text
          tokenizer: raw
        - name: url
          type: text
          tokenizer: raw
        - name: avatar_url
          type: text
          tokenizer: raw
    - name: repo
      type: object
      field_mappings:
        - name: id
          type: u64
          fast: true
        - name: name
          type: text
          fast: true
        - name: url
          type: text
          tokenizer: raw
    - name: payload
      type: json
      indexed: true
      tokenizer: default
      expand_dots: false
      record: position
    - name: created_at
      type: datetime
      precision: seconds
      fast: true
  timestamp_field: created_at

index_uri: s3://gharchive
```

Additional quickwit configuration:

```yaml
ingest_api:
  max_queue_memory_usage: 8GB
  max_queue_disk_usage: 16GB
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quickwit merge pipelines begins to fail during indexing #3035

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quickwit merge pipelines begins to fail during indexing #3035

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions