-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Implement timechart command functionality in OpenSearch PPL
Summary
Implement a timechart command in OpenSearch PPL that will allow users to generate temporal trend charts directly within PPL queries, streamlining the process of identifying patterns and anomalies in time-based datasets. It aggregates a specific field and returns a chart with time as the X-axis.
Background
Currently, PPL users rely on the stats command for time-series data visualization and statistical analysis. While the stats command provides various aggregation computations and data processing capabilities, it was not specifically designed for time-series visualization use cases.
Current PPL stats function capabilities:
- Verbose syntax for complex time-series queries
- No built-in visualization-ready formatting
- Manual sorting required (results not automatically time-ordered)
- No automatic time gap filling (missing time periods not handled)
- Span always appears as first column regardless of logical grouping preference
Additional timechart capabilities we want to add:
- Flexible span or bin parameter to control the length of time
- Other field: By default, when too many fields show up in the legend, timechart creates an OTHER field to capture the long tail
- by syntax, designed to accept only one grouping field
- Fill missing time periods with nulls
Proposed Implementation
Create a dedicated timechart command with comprehensive syntax:
Complete Timechart Command Syntax
...
| timechart
[span=<time-value>]
[bins=<int>]
[limit=<int>]
[useother=<boolean>]
(<aggregation-function> [by <field>])
Aggregation function will support all existing aggregation functions implemented with stats in PPL. Since stats currently only supports 1 aggregation function at a time, timechart will also only support 1 single aggregation function, not multiple.
Detailed Feature Requirements
1. Core Time Binning Parameters
span=<value> - Fixed interval binning
source=logs | timechart span=1h count() by host
bins=<number> - Fixed number of equal-width bins
source=logs | timechart bins=100 count()
Only 1 of span OR bin should be specified. If both are specified, span will override bin. If none are specified, default bin=100.
2. Series Management
limit=<int> - Control number of series
source=logs | timechart span=1h count() by status limit=5
Default: limit=10
For each status code in the split-by status field
Sort those values by sum of aggregation value (highest total count in this case)
Select top N values
This is different from head since head picks top N by row order so additional sorting and aggregation is needed.
useother=<boolean> - Handle long tail series
source=logs | timechart span=1h count() by url useother=f
Default: useother=true
Creates "OTHER" field to capture remaining series when limit exceeded
useother=false hides the OTHER field
3. Gap Filling and Continuity - Future Implementation
usenull=<boolean> - Null value handling
source=logs | timechart span=1h count() by status usenull=false
Default: usenull=true
Controls whether null values appear as separate series
cont=<boolean> - Fill time gaps
source=logs | timechart span=1h count() cont=true
Default: cont=true
Automatically fills missing time periods with null values
Essential for consistent time-series visualization
Implementation Requirements
Core Components
- Grammar: Extend PPL lexer/parser with timechart command tokens and syntax rules
- AST: Create/update Timechart node to support all parameters
- Logic: Implement timechart algorithms including
Automatic time bucketing with span/bins
OTHER field generation
Series limiting and filtering - Integration: Full Calcite integration with proper relational algebra generation
Key Features
- Time alignment: Proper bucket boundary calculation
- Series management: Top-N selection with OTHER field aggregation
- Memory optimization: Efficient handling of large time ranges with gap filling
Performance Considerations
- Memory usage: Gap filling can be memory-intensive for large time ranges
- Query optimization: Single-pass aggregation where possible
- Indexing: Leverage time-based indexes for efficient bucketing
Usage Examples
Basic Time Series
-- Grouped Time-Bucket Aggregation (Unlimited Series)
... | timechart span=1m limit=0 count() as errorCount by url
-- Time-based Throughput (No Grouping)
... | timechart count(_time) as tpm span=1m
-- Grouped Time-Bucket Aggregation (Default Series Filtering)
... | timechart span=1m count() as errorCount by url
Test Coverage Required
- Unit tests for timechart command parsing and AST generation
- Integration tests for end-to-end functionality
- Calcite tests for logical plan generation
- Performance tests with various timechart configurations
- Gap filling and series management edge cases
Documentation
docs/user/ppl/cmd/timechart.rst- Complete parameter documentation with examples
Implementation Discussion
- PPL does not have implicit _time or @ timestamp fields. We must use the actual time field names from the OpenSearch index mapping, making PPL more flexible but requiring explicit field configuration.
- Timechart supports both aggregation by field and eval function by field. However, PPL eval functions don't support conditional aggregation, so for this first draft of timechart implementation, I prioritized only the single aggregation by field since this is the most popular and top priority use case. eval can be supported in follow up PRs.
- Gap filling with cont and usenull are < 1% usage and other timechart parameters including: partial, fixedrange, aligntime, minspan, format, sep, tz, dedup can be implemented in followup PRs. I included the highest priority parameters in this RFC.
Implementation Options for Dynamic Pivot
In Calcite, it seems PIVOT only accepts static lists (meaning subqueries to generate dynamic list is not allowed).
1) 2 Query Approach
- First Query (Discovery): Execute a query to retrieve distinct values of the split-by field
SELECT DISTINCT host
FROM events
WHERE timestamp >= '2024-07-01 00:00:00'
AND timestamp < '2024-07-01 01:00:00'
Output:
[
{"host": "cache-01"},
{"host": "cache-02"},
{"host": "db-01"},
{"host": "db-02"},
]
- Second Query (Pivot Aggregation)
Using the discovered values, the second query generates conditional aggregations:
SELECT
DATE_TRUNC('hour', timestamp) as time_bucket,
SUM(CASE WHEN host = 'cache-01' THEN 1 ELSE 0 END) as "cache-01",
SUM(CASE WHEN host = 'cache-02' THEN 1 ELSE 0 END) as "cache-02",
-- ... one CASE statement per discovered host value
FROM events
WHERE timestamp >= '2024-07-01 00:00:00'
AND timestamp < '2024-07-01 01:00:00'
GROUP BY DATE_TRUNC('hour', timestamp)
- Advantages: Simpler, leverages existing query engine capabilities, produces final pivoted result directly from database engine
- Drawbacks: Performance and latency considerations with requiring two queries to OpenSearch, potential data consistency issue if data changes between queries
2) Add Custom SQL Operator / Planner Extension
- Extend the SQL engine itself (Calcite planner and execution) to include a new custom operator that dynamically computes the distinct split-by values and pivots the data within a single query execution.
- The planner builds a query plan that handles discovering distinct values and generating pivot columns on-the-fly.
- Advantages: Achieves dynamic pivoting in one single query, no external orchestration.
- Drawbacks: Requires deep integration and significant changes to the SQL parser, planner, and executor
3) Post-processing Pivot in Formatter Layer
Single Query (Normal Aggregation): Execute timechart with standard grouping:
SELECT
DATE_TRUNC('hour', timestamp) as time_bucket,
host,
COUNT(*) as count_value
FROM events
WHERE timestamp >= '2024-07-01 00:00:00'
AND timestamp < '2024-07-01 01:00:00'
GROUP BY DATE_TRUNC('hour', timestamp), host
Query Output (Long Format):
{
"schema": [
{"name": "time_bucket", "type": "timestamp"},
{"name": "host", "type": "string"},
{"name": "count_value", "type": "bigint"}
],
"datarows": [
["2024-07-01 00:00:00", "cache-01", 1],
["2024-07-01 00:00:00", "cache-02", 1],
["2024-07-01 00:00:00", "web-01", 6],
// ... one row per time bucket + host combination
]
}
Post-Processing: The formatter layer transforms this "long" format into "wide" format by:
- Discovering distinct split-by values (cache-01, cache-02, etc.)
- Creating new schema with time column + one column per distinct value
- Pivoting rows by grouping on time_bucket and spreading host values across columns
- Filling missing values with 0 for hosts with no data in a time bucket
- Advantages: Does not require changes to the SQL engine or query planner, flexible, single query execution, easier to implement and test.
- Drawbacks: Extra processing overhead outside the query engine, memory intensive for large result sets, potential inefficiencies since cannot leverage query optimizations inside the engine.
4) Direct SQL Queries Without Pivot [CURRENT APPROACH]
Implement SQL Queries in CalciteRelNodeVisitor directly without pivot functionality. Final output is unpivoted with limit and useother applied.
| @timestamp | Age | Count |
|---|---|---|
| 2023-01-01T01:00:00 | 50 | 3 |
| 2023-01-01T01:00:00 | 40 | 2 |
| 2023-01-01T01:00:00 | OTHER | 0 |
| 2023-01-01T02:00:00 | 50 | 2 |
| 2023-01-01T02:00:00 | 40 | 3 |
| 2023-01-01T02:00:00 | OTHER | 0 |
| 2023-01-01T03:00:00 | 50 | 1 |
| 2023-01-01T03:00:00 | 40 | 1 |
| 2023-01-01T03:00:00 | OTHER | 2 |
Open Questions
- Which approach would you recommend for implementing dynamic pivot in the timechart command?
- Are there other strategies or best practices we should consider?
- What additional challenges / tradeoffs should we be aware of?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status