Skip to content

Quickwit merge pipelines begins to fail during indexing #3035

@ChillFish8

Description

@ChillFish8

While indexing the GitHub Archive dataset of about 1 month of data, I noticed Quickwit seems to start intermittently failing to run merge pipelines while indexing and has some unusual affects:

  • A merge will occur and attempt merge splits A, B, C this operation succeeds.
  • The system will mark the splits for deletion
  • Sometime later a second merge operation gets scheduled to attempt to merge split C again, this operation fails.
  • Over time this happens more often.

Logs:
logs.txt

Reproduction:
The issue seems to be somewhat intermittent, but occurs on a recent update, the easiest way to try reproduce would be via docker:

  • Download the 2015 GH archive dataset for the month of January.
  • Run the chillfish8/quickwit-main:latest docker image (it's just a docker build of what ever version of main is I'm testing.
  • Feed the data into quickwit

You may need to increase the concurrent connections pushing data into quickwit in order to create this instability, I first saw this issue when running 4+ simultaneous connections.

Index File:

version: "0.4"
index_id: gharchive

doc_mapping:
  store_source: true
  field_mappings:
    - name: id
      type: text
      tokenizer: raw
    - name: type
      type: text
      tokenizer: raw
      fast: true
    - name: actor
      type: object
      field_mappings:
        - name: id
          type: u64
          fast: true
        - name: login
          type: text
          fast: true
          tokenizer: raw
        - name: gravatar_id
          type: text
          tokenizer: raw
        - name: url
          type: text
          tokenizer: raw
        - name: avatar_url
          type: text
          tokenizer: raw
    - name: repo
      type: object
      field_mappings:
        - name: id
          type: u64
          fast: true
        - name: name
          type: text
          fast: true
        - name: url
          type: text
          tokenizer: raw
    - name: payload
      type: json
      indexed: true
      tokenizer: default
      expand_dots: false
      record: position
    - name: created_at
      type: datetime
      precision: seconds
      fast: true
  timestamp_field: created_at

index_uri: s3://gharchive

Additional quickwit configuration:

ingest_api:
  max_queue_memory_usage: 8GB
  max_queue_disk_usage: 16GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions