Skip to content

[RFC] Support timechart command in PPL #3965

@selsong

Description

@selsong

Implement timechart command functionality in OpenSearch PPL

Summary

Implement a timechart command in OpenSearch PPL that will allow users to generate temporal trend charts directly within PPL queries, streamlining the process of identifying patterns and anomalies in time-based datasets. It aggregates a specific field and returns a chart with time as the X-axis.

Background

Currently, PPL users rely on the stats command for time-series data visualization and statistical analysis. While the stats command provides various aggregation computations and data processing capabilities, it was not specifically designed for time-series visualization use cases.

Current PPL stats function capabilities:

  • Verbose syntax for complex time-series queries
  • No built-in visualization-ready formatting
  • Manual sorting required (results not automatically time-ordered)
  • No automatic time gap filling (missing time periods not handled)
  • Span always appears as first column regardless of logical grouping preference

Additional timechart capabilities we want to add:

  • Flexible span or bin parameter to control the length of time
  • Other field: By default, when too many fields show up in the legend, timechart creates an OTHER field to capture the long tail
  • by syntax, designed to accept only one grouping field
  • Fill missing time periods with nulls

Proposed Implementation

Create a dedicated timechart command with comprehensive syntax:

Complete Timechart Command Syntax

... 
| timechart 
    [span=<time-value>] 
    [bins=<int>]
    [limit=<int>]
    [useother=<boolean>]
    (<aggregation-function> [by <field>])

Aggregation function will support all existing aggregation functions implemented with stats in PPL. Since stats currently only supports 1 aggregation function at a time, timechart will also only support 1 single aggregation function, not multiple.

Detailed Feature Requirements

1. Core Time Binning Parameters

span=<value> - Fixed interval binning

source=logs | timechart span=1h count() by host

bins=<number> - Fixed number of equal-width bins

source=logs | timechart bins=100 count()
Only 1 of span OR bin should be specified. If both are specified, span will override bin. If none are specified, default bin=100.

2. Series Management

limit=<int> - Control number of series

source=logs | timechart span=1h count() by status limit=5
Default: limit=10
For each status code in the split-by status field
Sort those values by sum of aggregation value (highest total count in this case)
Select top N values
This is different from head since head picks top N by row order so additional sorting and aggregation is needed.

useother=<boolean> - Handle long tail series

source=logs | timechart span=1h count() by url useother=f
Default: useother=true
Creates "OTHER" field to capture remaining series when limit exceeded
useother=false hides the OTHER field

3. Gap Filling and Continuity - Future Implementation

usenull=<boolean> - Null value handling

source=logs | timechart span=1h count() by status usenull=false
Default: usenull=true
Controls whether null values appear as separate series

cont=<boolean> - Fill time gaps

source=logs | timechart span=1h count() cont=true
Default: cont=true
Automatically fills missing time periods with null values
Essential for consistent time-series visualization

Implementation Requirements

Core Components

  • Grammar: Extend PPL lexer/parser with timechart command tokens and syntax rules
  • AST: Create/update Timechart node to support all parameters
  • Logic: Implement timechart algorithms including
    Automatic time bucketing with span/bins
    OTHER field generation
    Series limiting and filtering
  • Integration: Full Calcite integration with proper relational algebra generation

Key Features

  • Time alignment: Proper bucket boundary calculation
  • Series management: Top-N selection with OTHER field aggregation
  • Memory optimization: Efficient handling of large time ranges with gap filling

Performance Considerations

  • Memory usage: Gap filling can be memory-intensive for large time ranges
  • Query optimization: Single-pass aggregation where possible
  • Indexing: Leverage time-based indexes for efficient bucketing

Usage Examples

Basic Time Series

-- Grouped Time-Bucket Aggregation (Unlimited Series)
... | timechart span=1m limit=0 count() as errorCount by url

-- Time-based Throughput (No Grouping)
... | timechart count(_time) as tpm span=1m

-- Grouped Time-Bucket Aggregation (Default Series Filtering)
... | timechart span=1m count() as errorCount by url

Test Coverage Required

  • Unit tests for timechart command parsing and AST generation
  • Integration tests for end-to-end functionality
  • Calcite tests for logical plan generation
  • Performance tests with various timechart configurations
  • Gap filling and series management edge cases

Documentation

  • docs/user/ppl/cmd/timechart.rst - Complete parameter documentation with examples

Implementation Discussion

  1. PPL does not have implicit _time or @ timestamp fields. We must use the actual time field names from the OpenSearch index mapping, making PPL more flexible but requiring explicit field configuration.
  2. Timechart supports both aggregation by field and eval function by field. However, PPL eval functions don't support conditional aggregation, so for this first draft of timechart implementation, I prioritized only the single aggregation by field since this is the most popular and top priority use case. eval can be supported in follow up PRs.
  3. Gap filling with cont and usenull are < 1% usage and other timechart parameters including: partial, fixedrange, aligntime, minspan, format, sep, tz, dedup can be implemented in followup PRs. I included the highest priority parameters in this RFC.

Implementation Options for Dynamic Pivot
In Calcite, it seems PIVOT only accepts static lists (meaning subqueries to generate dynamic list is not allowed).

1) 2 Query Approach

  • First Query (Discovery): Execute a query to retrieve distinct values of the split-by field
SELECT DISTINCT host   
FROM events   
WHERE timestamp >= '2024-07-01 00:00:00'   
  AND timestamp < '2024-07-01 01:00:00'

Output:

[  
  {"host": "cache-01"},  
  {"host": "cache-02"},   
  {"host": "db-01"},  
  {"host": "db-02"},  
]
  • Second Query (Pivot Aggregation)
    Using the discovered values, the second query generates conditional aggregations:
SELECT   
  DATE_TRUNC('hour', timestamp) as time_bucket,  
  SUM(CASE WHEN host = 'cache-01' THEN 1 ELSE 0 END) as "cache-01",  
  SUM(CASE WHEN host = 'cache-02' THEN 1 ELSE 0 END) as "cache-02",  
-- ... one CASE statement per discovered host value 
FROM events  
WHERE timestamp >= '2024-07-01 00:00:00'   
  AND timestamp < '2024-07-01 01:00:00'  
GROUP BY DATE_TRUNC('hour', timestamp)
  • Advantages: Simpler, leverages existing query engine capabilities, produces final pivoted result directly from database engine
  • Drawbacks: Performance and latency considerations with requiring two queries to OpenSearch, potential data consistency issue if data changes between queries

2) Add Custom SQL Operator / Planner Extension

  • Extend the SQL engine itself (Calcite planner and execution) to include a new custom operator that dynamically computes the distinct split-by values and pivots the data within a single query execution.
  • The planner builds a query plan that handles discovering distinct values and generating pivot columns on-the-fly.
  • Advantages: Achieves dynamic pivoting in one single query, no external orchestration.
  • Drawbacks: Requires deep integration and significant changes to the SQL parser, planner, and executor

3) Post-processing Pivot in Formatter Layer
Single Query (Normal Aggregation): Execute timechart with standard grouping:

SELECT   
  DATE_TRUNC('hour', timestamp) as time_bucket,  
  host,  
  COUNT(*) as count_value  
FROM events  
WHERE timestamp >= '2024-07-01 00:00:00'   
  AND timestamp < '2024-07-01 01:00:00'  
GROUP BY DATE_TRUNC('hour', timestamp), host

Query Output (Long Format):

{  
  "schema": [  
    {"name": "time_bucket", "type": "timestamp"},  
    {"name": "host", "type": "string"},   
    {"name": "count_value", "type": "bigint"}  
  ],  
  "datarows": [  
    ["2024-07-01 00:00:00", "cache-01", 1],  
    ["2024-07-01 00:00:00", "cache-02", 1],  
    ["2024-07-01 00:00:00", "web-01", 6],  
    // ... one row per time bucket + host combination  
  ]  
}

Post-Processing: The formatter layer transforms this "long" format into "wide" format by:

  • Discovering distinct split-by values (cache-01, cache-02, etc.)
  • Creating new schema with time column + one column per distinct value
  • Pivoting rows by grouping on time_bucket and spreading host values across columns
  • Filling missing values with 0 for hosts with no data in a time bucket
  • Advantages: Does not require changes to the SQL engine or query planner, flexible, single query execution, easier to implement and test.
  • Drawbacks: Extra processing overhead outside the query engine, memory intensive for large result sets, potential inefficiencies since cannot leverage query optimizations inside the engine.

4) Direct SQL Queries Without Pivot [CURRENT APPROACH]
Implement SQL Queries in CalciteRelNodeVisitor directly without pivot functionality. Final output is unpivoted with limit and useother applied.

@timestamp Age Count
2023-01-01T01:00:00 50 3
2023-01-01T01:00:00 40 2
2023-01-01T01:00:00 OTHER 0
2023-01-01T02:00:00 50 2
2023-01-01T02:00:00 40 3
2023-01-01T02:00:00 OTHER 0
2023-01-01T03:00:00 50 1
2023-01-01T03:00:00 40 1
2023-01-01T03:00:00 OTHER 2

Open Questions

  • Which approach would you recommend for implementing dynamic pivot in the timechart command?
  • Are there other strategies or best practices we should consider?
  • What additional challenges / tradeoffs should we be aware of?

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or requestv3.3.0

Type

No type

Projects

Status

New

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions