-
Notifications
You must be signed in to change notification settings - Fork 178
Description
What is the bug?
While messing with aggregation group-by on a dataset with 50000 records, I noticed that it's not processing all the buckets for queries:
> source = nginx_logs
| stats count() by method, bytes
| stats count();
fetched rows / total rows = 1/1
+---------+
| count() |
|---------|
| 1000 | <--- Should be at least 10804, according to below
+---------+
> source = nginx_logs
| dedup method, bytes
| stats count();
fetched rows / total rows = 1/1
+---------+
| count() |
|---------|
| 10804 |
+---------+
The bucket count limitation is understandable (they're memory-intensive and shipping many thousands of buckets would exhaust bandwidth), but it does cause issues when trying to work with sorted aggregations. In this dataset, the top buckets should all be GET, not DELETE.
> source = nginx_logs
| stats count() by method, bytes
| sort -`count()`
| head;
fetched rows / total rows = 10/10
+---------+--------+-------+
| count() | method | bytes |
|---------+--------+-------|
| 13 | DELETE | 81 |
| 12 | DELETE | 74 |
| 10 | DELETE | 71 |
| 10 | DELETE | 95 |
| 10 | DELETE | 51 |
| 10 | DELETE | 88 |
| 9 | DELETE | 78 |
| 9 | DELETE | 53 |
| 9 | DELETE | 49 |
| 9 | DELETE | 55 |
+---------+--------+-------+
This equivalently leads to incorrect top output:
> source = nginx_logs
| top bytes by method
| head;
fetched rows / total rows = 10/10
+--------+-------+-------+
| method | bytes | count |
|--------+-------+-------|
| DELETE | 81 | 13 |
| DELETE | 74 | 12 |
| DELETE | 51 | 10 |
| DELETE | 71 | 10 |
| DELETE | 88 | 10 |
| DELETE | 95 | 10 |
| DELETE | 44 | 9 |
| DELETE | 49 | 9 |
| DELETE | 52 | 9 |
| DELETE | 53 | 9 |
+--------+-------+-------+
How can one reproduce the bug?
- Create a dataset with at least 1000 different buckets according to some sort of grouping
- Observe incorrect results when trying to collect/sort them
What is the expected behavior?
It seems like the buckets come out lexicographically sorted today, which implies that internally all the buckets are already being computed and sorted (i.e. returning incorrect results isn't a major cost optimization). I wonder if it'd be possible to support sorting in the bucket aggregation, so the 1000 records returned are actually the correct ones.
What is your host/environment?
Mainline
Do you have any screenshots?
N/A
Do you have any additional context?
N/A
Metadata
Metadata
Assignees
Labels
Type
Projects
Status