Skip to content

Conversation

@night-owl-1709
Copy link
Contributor

Summary

This PR adds field-level disk consumption statistics to the _stats API, allowing users to understand how much disk space each field consumes in their Lucene segment files. This feature helps identify which fields are taking up the most space and enables better index optimization decisions.

Related Issue

Resolves #12113

Motivation

Users currently have segment-level file size statistics but lack visibility into which fields contribute to disk usage. This makes it difficult to:

  • Identify high-cardinality fields consuming excessive disk space
  • Optimize field mappings based on actual storage costs
  • Make informed decisions about field deletion or type changes

Implementation Details

Approach

The implementation uses proportional attribution to calculate field-level statistics:

  1. For each segment, read FieldInfos to determine which fields use which file types
  2. Get existing file sizes by extension (.tim, .dvd, .doc, etc.)
  3. Divide each file's size equally among all fields that use that file type

Testing

Stats API Call

curl -X GET "http://localhost:9200/my-index/_stats/segments?include_field_level_segment_file_sizes=true"

Output

{ "_shards": { "total": 6, "successful": 3, "failed": 0 }, "_all": { "primaries": { "segments": { "count": 3, "memory_in_bytes": 0, "terms_memory_in_bytes": 0, "stored_fields_memory_in_bytes": 0, "term_vectors_memory_in_bytes": 0, "norms_memory_in_bytes": 0, "points_memory_in_bytes": 0, "doc_values_memory_in_bytes": 0, "index_writer_memory_in_bytes": 0, "version_map_memory_in_bytes": 0, "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": -1, "remote_store": { "upload": { "total_upload_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "refresh_size_lag": { "total_bytes": 0, "max_bytes": 0 }, "max_refresh_time_lag_in_millis": 0, "total_time_spent_in_millis": 0, "pressure": { "total_rejections": 0 } }, "download": { "total_download_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "total_time_spent_in_millis": 0 } }, "segment_replication": { "max_bytes_behind": 0, "total_bytes_behind": 0, "max_replication_lag": 0 }, "file_sizes": {}, "field_level_file_sizes": { "_seq_no": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "author": { "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "dvm": { "size_in_bytes": 358 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 } }, "_source": { "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "fdx": { "size_in_bytes": 18 } }, "_id": { "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "title": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } }, "publish_date": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "_version": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "_primary_term": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "views": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "content": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } } } } }, "total": { "segments": { "count": 3, "memory_in_bytes": 0, "terms_memory_in_bytes": 0, "stored_fields_memory_in_bytes": 0, "term_vectors_memory_in_bytes": 0, "norms_memory_in_bytes": 0, "points_memory_in_bytes": 0, "doc_values_memory_in_bytes": 0, "index_writer_memory_in_bytes": 0, "version_map_memory_in_bytes": 0, "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": -1, "remote_store": { "upload": { "total_upload_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "refresh_size_lag": { "total_bytes": 0, "max_bytes": 0 }, "max_refresh_time_lag_in_millis": 0, "total_time_spent_in_millis": 0, "pressure": { "total_rejections": 0 } }, "download": { "total_download_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "total_time_spent_in_millis": 0 } }, "segment_replication": { "max_bytes_behind": 0, "total_bytes_behind": 0, "max_replication_lag": 0 }, "file_sizes": {}, "field_level_file_sizes": { "_seq_no": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "author": { "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "dvm": { "size_in_bytes": 358 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 } }, "_source": { "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "fdx": { "size_in_bytes": 18 } }, "_id": { "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "title": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } }, "publish_date": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "_version": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "_primary_term": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "views": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "content": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } } } } } }, "indices": { "my-index": { "uuid": "QjqlEhehSRS8pI4xr6_gbg", "primaries": { "segments": { "count": 3, "memory_in_bytes": 0, "terms_memory_in_bytes": 0, "stored_fields_memory_in_bytes": 0, "term_vectors_memory_in_bytes": 0, "norms_memory_in_bytes": 0, "points_memory_in_bytes": 0, "doc_values_memory_in_bytes": 0, "index_writer_memory_in_bytes": 0, "version_map_memory_in_bytes": 0, "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": -1, "remote_store": { "upload": { "total_upload_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "refresh_size_lag": { "total_bytes": 0, "max_bytes": 0 }, "max_refresh_time_lag_in_millis": 0, "total_time_spent_in_millis": 0, "pressure": { "total_rejections": 0 } }, "download": { "total_download_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "total_time_spent_in_millis": 0 } }, "segment_replication": { "max_bytes_behind": 0, "total_bytes_behind": 0, "max_replication_lag": 0 }, "file_sizes": {}, "field_level_file_sizes": { "_seq_no": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "author": { "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "dvm": { "size_in_bytes": 358 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 } }, "_source": { "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "fdx": { "size_in_bytes": 18 } }, "_id": { "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "title": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } }, "publish_date": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "_version": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "_primary_term": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "views": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "content": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } } } } }, "total": { "segments": { "count": 3, "memory_in_bytes": 0, "terms_memory_in_bytes": 0, "stored_fields_memory_in_bytes": 0, "term_vectors_memory_in_bytes": 0, "norms_memory_in_bytes": 0, "points_memory_in_bytes": 0, "doc_values_memory_in_bytes": 0, "index_writer_memory_in_bytes": 0, "version_map_memory_in_bytes": 0, "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": -1, "remote_store": { "upload": { "total_upload_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "refresh_size_lag": { "total_bytes": 0, "max_bytes": 0 }, "max_refresh_time_lag_in_millis": 0, "total_time_spent_in_millis": 0, "pressure": { "total_rejections": 0 } }, "download": { "total_download_size": { "started_bytes": 0, "succeeded_bytes": 0, "failed_bytes": 0 }, "total_time_spent_in_millis": 0 } }, "segment_replication": { "max_bytes_behind": 0, "total_bytes_behind": 0, "max_replication_lag": 0 }, "file_sizes": {}, "field_level_file_sizes": { "_seq_no": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "author": { "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "dvm": { "size_in_bytes": 358 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 } }, "_source": { "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 }, "fdx": { "size_in_bytes": 18 } }, "_id": { "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "title": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } }, "publish_date": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "_version": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "_primary_term": { "dvm": { "size_in_bytes": 358 }, "dvd": { "size_in_bytes": 48 }, "fdx": { "size_in_bytes": 18 }, "fnm": { "size_in_bytes": 315 }, "fdt": { "size_in_bytes": 202 } }, "views": { "kdi": { "size_in_bytes": 69 }, "dvd": { "size_in_bytes": 48 }, "fnm": { "size_in_bytes": 315 }, "kdm": { "size_in_bytes": 249 }, "fdt": { "size_in_bytes": 202 }, "kdd": { "size_in_bytes": 129 }, "dvm": { "size_in_bytes": 358 }, "fdx": { "size_in_bytes": 18 } }, "content": { "nvm": { "size_in_bytes": 207 }, "fnm": { "size_in_bytes": 315 }, "pos": { "size_in_bytes": 177 }, "fdt": { "size_in_bytes": 202 }, "doc": { "size_in_bytes": 66 }, "tim": { "size_in_bytes": 340 }, "tip": { "size_in_bytes": 84 }, "fdx": { "size_in_bytes": 18 }, "nvd": { "size_in_bytes": 94 } } } } } } } }

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Indexing Indexing, Bulk Indexing and anything related to indexing Indexing:Performance lucene Other labels Oct 23, 2025
@github-actions github-actions bot added Indexing Indexing, Bulk Indexing and anything related to indexing Indexing:Performance lucene Other labels Oct 23, 2025
@github-actions
Copy link
Contributor

❌ Gradle check result for 185a301: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 2d1c12c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@msfroh
Copy link
Contributor

msfroh commented Oct 23, 2025

Divide each file's size equally among all fields that use that file type

Hmm... I don't think that's accurate. For example, not every document needs to have every field. So, an extremely sparse field may contribute much less to a given file than other fields. My main concern is that it could be misleading.

On the other hand, I'm not sure if there's much else we could do without changes to Lucene APIs. The true size used by a field is going to be format-specific. I think we would need to add something to each of the ...Format types that can be returned by a Codec, then each format would need to implement the logic to return field size.

@msfroh
Copy link
Contributor

msfroh commented Oct 23, 2025

@rishabhmaurya -- Do you have a better idea for how to get per-field data sizes? Am I overcomplicating things or is it actually hard?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Indexing:Performance Indexing Indexing, Bulk Indexing and anything related to indexing lucene Other

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Field level statistics of lucene index files

2 participants