Skip to content

Uneven distribution of docs across shards, even with auto-generated ids #8041

@EmilBode

Description

@EmilBode

Elasticsearch Version

7.17.15

Installed Plugins

No response

Java Version

bundled

OS Version

Ubuntu 20..04.6 LTS

Problem Description

We've been running several parallel processes that all sent bulk-indexes to an index. The documents from a single process now seem to be very unevenly distributed across our shards.
Looking at one part, I find that GET indexname/_count?preference=_shards: gives results ranging from 2215 to 143810 documents on a single shard.

Steps to Reproduce

Index creation

PUT myindex
{
  "settings": {
    "number_of_shards": 20,
    "number_of_replicas": 1,
    "refresh_interval": "300s",
    "routing": {
      "allocation": {
        "include": {
          "_tier_preference": "data_warm,data_hot"
        }
      }
    }
  }
}

Bulk indexing

Spin up 6 different .NET projects, who all use the NEST-client, to bulk-index documents:

ElasticClient = new ElasticClient(connnectionSettings)
var results = ElasticClient.BulkAll(objects, b=>b.Index(myindex).
    .BufferToBulk((descriptor, list) => 
        {foreach(var obj in list) {descriptor.Index(i => i.Document(obj))
    .RefreshOnCompleted(false)
    .MaxDegreeOfParallelism(4)
    .Size(10))

Expected behavior

Even distribution of all documents, also meaning the documents from process 1 are evenly spread, docs from process 2 are evenly spread, etc.

Observed behavior

While looking at all documents together, the spread is reasonably, but when just looking at documents from a single process, they disproportionately end up at a few shards

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions