- 
                Notifications
    You must be signed in to change notification settings 
- Fork 491
Closed
Labels
Description
While indexing the GitHub Archive dataset of about 1 month of data, I noticed Quickwit seems to start intermittently failing to run merge pipelines while indexing and has some unusual affects:
- A merge will occur and attempt merge splits A,B,Cthis operation succeeds.
- The system will mark the splits for deletion
- Sometime later a second merge operation gets scheduled to attempt to merge split Cagain, this operation fails.
- Over time this happens more often.
Logs:
logs.txt
Reproduction:
The issue seems to be somewhat intermittent, but occurs on a recent update, the easiest way to try reproduce would be via docker:
- Download the 2015 GH archive dataset for the month of January.
- Run the chillfish8/quickwit-main:latestdocker image (it's just a docker build of what ever version ofmainis I'm testing.
- Feed the data into quickwit
You may need to increase the concurrent connections pushing data into quickwit in order to create this instability, I first saw this issue when running 4+ simultaneous connections.
Index File:
version: "0.4"
index_id: gharchive
doc_mapping:
  store_source: true
  field_mappings:
    - name: id
      type: text
      tokenizer: raw
    - name: type
      type: text
      tokenizer: raw
      fast: true
    - name: actor
      type: object
      field_mappings:
        - name: id
          type: u64
          fast: true
        - name: login
          type: text
          fast: true
          tokenizer: raw
        - name: gravatar_id
          type: text
          tokenizer: raw
        - name: url
          type: text
          tokenizer: raw
        - name: avatar_url
          type: text
          tokenizer: raw
    - name: repo
      type: object
      field_mappings:
        - name: id
          type: u64
          fast: true
        - name: name
          type: text
          fast: true
        - name: url
          type: text
          tokenizer: raw
    - name: payload
      type: json
      indexed: true
      tokenizer: default
      expand_dots: false
      record: position
    - name: created_at
      type: datetime
      precision: seconds
      fast: true
  timestamp_field: created_at
index_uri: s3://gharchiveAdditional quickwit configuration:
ingest_api:
  max_queue_memory_usage: 8GB
  max_queue_disk_usage: 16GB