[RFC] Support `timechart` command in PPL

# Implement `timechart` command functionality in OpenSearch PPL
## Summary
Implement a `timechart` command in OpenSearch PPL that will allow users to generate temporal trend charts directly within PPL queries, streamlining the process of identifying patterns and anomalies in time-based datasets. It aggregates a specific field and returns a chart with time as the X-axis.

## Background
Currently, PPL users rely on the stats command for time-series data visualization and statistical analysis. While the stats command provides various aggregation computations and data processing capabilities, it was not specifically designed for time-series visualization use cases.
### Current PPL `stats` function capabilities:
* Verbose syntax for complex time-series queries
* No built-in visualization-ready formatting
* Manual sorting required (results not automatically time-ordered)
* No automatic time gap filling (missing time periods not handled)
* Span always appears as first column regardless of logical grouping preference

### Additional timechart capabilities we want to add:
* Flexible span or bin parameter to control the length of time
* Other field: By default, when too many fields show up in the legend, timechart creates an OTHER field to capture the long tail
* by syntax, designed to accept only one grouping field
* Fill missing time periods with nulls

## Proposed Implementation
Create a dedicated `timechart` command with comprehensive syntax:

### Complete Timechart Command Syntax
```
... 
| timechart 
    [span=<time-value>] 
    [bins=<int>]
    [limit=<int>]
    [useother=<boolean>]
    (<aggregation-function> [by <field>])
```
Aggregation function will support all existing aggregation functions implemented with stats in PPL. Since stats currently only supports 1 aggregation function at a time, timechart will also only support 1 single aggregation function, not multiple.

## Detailed Feature Requirements
### 1. Core Time Binning Parameters
#### **`span=<value>`** - Fixed interval binning
`source=logs | timechart span=1h count() by host`
#### **`bins=<number>`** - Fixed number of equal-width bins
`source=logs | timechart bins=100 count()`
Only 1 of span OR bin should be specified. If both are specified, span will override bin. If none are specified, default bin=100.

### 2. Series Management
#### **`limit=<int>`** - Control number of series
`source=logs | timechart span=1h count() by status limit=5`
Default: limit=10
For each status code in the split-by status field
Sort those values by sum of aggregation value (highest total count in this case)
Select top N values
This is different from head since head picks top N by row order so additional sorting and aggregation is needed.
#### **`useother=<boolean>`** - Handle long tail series
`source=logs | timechart span=1h count() by url useother=f`
Default: useother=true
Creates "OTHER" field to capture remaining series when limit exceeded
useother=false hides the OTHER field

### 3.  Gap Filling and Continuity - Future Implementation
#### **`usenull=<boolean>`** - Null value handling
`source=logs | timechart span=1h count() by status usenull=false`
Default: usenull=true
Controls whether null values appear as separate series
#### **`cont=<boolean>`** - Fill time gaps
`source=logs | timechart span=1h count() cont=true`
Default: cont=true
Automatically fills missing time periods with null values
Essential for consistent time-series visualization

## Implementation Requirements
### Core Components
* **Grammar**: Extend PPL lexer/parser with timechart command tokens and syntax rules
* **AST**: Create/update Timechart node to support all parameters
* **Logic**: Implement timechart algorithms including 
Automatic time bucketing with span/bins
OTHER field generation
Series limiting and filtering
* **Integration**: Full Calcite integration with proper relational algebra generation

### Key Features
* Time alignment: Proper bucket boundary calculation
* Series management: Top-N selection with OTHER field aggregation
* Memory optimization: Efficient handling of large time ranges with gap filling

### Performance Considerations
* Memory usage: Gap filling can be memory-intensive for large time ranges
* Query optimization: Single-pass aggregation where possible
* Indexing: Leverage time-based indexes for efficient bucketing

## Usage Examples
### Basic Time Series 
-- Grouped Time-Bucket Aggregation (Unlimited Series)
`... | timechart span=1m limit=0 count() as errorCount by url`
 
-- Time-based Throughput (No Grouping)
`... | timechart count(_time) as tpm span=1m`

-- Grouped Time-Bucket Aggregation (Default Series Filtering)
`... | timechart span=1m count() as errorCount by url`

### Test Coverage Required
 * Unit tests for timechart command parsing and AST generation
* Integration tests for end-to-end functionality
* Calcite tests for logical plan generation
* Performance tests with various timechart configurations
* Gap filling and series management edge cases

### Documentation
* `docs/user/ppl/cmd/timechart.rst` - Complete parameter documentation with examples

**Implementation Discussion**
1. PPL does not have implicit _time or @ timestamp fields. We must use the actual time field names from the OpenSearch index mapping, making PPL more flexible but requiring explicit field configuration.
2. Timechart supports both aggregation by field and eval function by field. However, PPL eval functions don't support conditional aggregation, so for this first draft of timechart implementation, I prioritized only the single aggregation by field since this is the most popular and top priority use case. eval can be supported in follow up PRs.
3. Gap filling with cont and usenull are < 1% usage and other timechart parameters including: partial, fixedrange, aligntime, minspan, format, sep, tz, dedup can be implemented in followup PRs. I included the highest priority parameters in this RFC. 

***Implementation Options for Dynamic Pivot***
In Calcite, it seems PIVOT only accepts **static** lists (meaning subqueries to generate dynamic list is not allowed).   

**1) 2 Query Approach**
* First Query (Discovery): Execute a query to retrieve distinct values of the split-by field
```
SELECT DISTINCT host   
FROM events   
WHERE timestamp >= '2024-07-01 00:00:00'   
  AND timestamp < '2024-07-01 01:00:00'
```
Output:
```
[  
  {"host": "cache-01"},  
  {"host": "cache-02"},   
  {"host": "db-01"},  
  {"host": "db-02"},  
]
```
* **Second Query (Pivot Aggregation)**
Using the discovered values, the second query generates conditional aggregations:
```
SELECT   
  DATE_TRUNC('hour', timestamp) as time_bucket,  
  SUM(CASE WHEN host = 'cache-01' THEN 1 ELSE 0 END) as "cache-01",  
  SUM(CASE WHEN host = 'cache-02' THEN 1 ELSE 0 END) as "cache-02",  
-- ... one CASE statement per discovered host value 
FROM events  
WHERE timestamp >= '2024-07-01 00:00:00'   
  AND timestamp < '2024-07-01 01:00:00'  
GROUP BY DATE_TRUNC('hour', timestamp)
```
* Advantages: Simpler, leverages existing query engine capabilities, produces final pivoted result directly from database engine
* Drawbacks: Performance and latency considerations with requiring two queries to OpenSearch, potential data consistency issue if data changes between queries

**2) Add Custom SQL Operator / Planner Extension**

* Extend the SQL engine itself (Calcite planner and execution) to include a new custom operator that dynamically computes the distinct split-by values and pivots the data within a single query execution.
* The planner builds a query plan that handles discovering distinct values and generating pivot columns on-the-fly.
* Advantages: Achieves dynamic pivoting in one single query, no external orchestration.
* Drawbacks: Requires deep integration and significant changes to the SQL parser, planner, and executor

**3) Post-processing Pivot in Formatter Layer**
**Single Query (Normal Aggregation):** Execute timechart with standard grouping:
```
SELECT   
  DATE_TRUNC('hour', timestamp) as time_bucket,  
  host,  
  COUNT(*) as count_value  
FROM events  
WHERE timestamp >= '2024-07-01 00:00:00'   
  AND timestamp < '2024-07-01 01:00:00'  
GROUP BY DATE_TRUNC('hour', timestamp), host
```
**Query Output (Long Format):**
```
{  
  "schema": [  
    {"name": "time_bucket", "type": "timestamp"},  
    {"name": "host", "type": "string"},   
    {"name": "count_value", "type": "bigint"}  
  ],  
  "datarows": [  
    ["2024-07-01 00:00:00", "cache-01", 1],  
    ["2024-07-01 00:00:00", "cache-02", 1],  
    ["2024-07-01 00:00:00", "web-01", 6],  
    // ... one row per time bucket + host combination  
  ]  
}
```
**Post-Processing:** The formatter layer transforms this "long" format into "wide" format by:
* Discovering distinct split-by values (cache-01, cache-02, etc.)
* Creating new schema with time column + one column per distinct value
* Pivoting rows by grouping on time_bucket and spreading host values across columns
* Filling missing values with 0 for hosts with no data in a time bucket
* Advantages: Does not require changes to the SQL engine or query planner, flexible, single query execution, easier to implement and test.
* Drawbacks: Extra processing overhead outside the query engine, memory intensive for large result sets, potential inefficiencies since cannot leverage query optimizations inside the engine.

**4) Direct SQL Queries Without Pivot [CURRENT APPROACH]** 
Implement SQL Queries in CalciteRelNodeVisitor directly without pivot functionality. Final output is unpivoted with limit and useother applied. 
| @timestamp          | Age   | Count |
|---------------------|-------|-------|
| 2023-01-01T01:00:00 | 50    | 3     |
| 2023-01-01T01:00:00 | 40    | 2     |
| 2023-01-01T01:00:00 | OTHER | 0     |
| 2023-01-01T02:00:00 | 50    | 2     |
| 2023-01-01T02:00:00 | 40    | 3     |
| 2023-01-01T02:00:00 | OTHER | 0     |
| 2023-01-01T03:00:00 | 50    | 1     |
| 2023-01-01T03:00:00 | 40    | 1     |
| 2023-01-01T03:00:00 | OTHER | 2     |

**Open Questions**
* Which approach would you recommend for implementing dynamic pivot in the timechart command?  
* Are there other strategies or best practices we should consider?
* What additional challenges / tradeoffs should we be aware of?  



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Support `timechart` command in PPL #3965

Implement `timechart` command functionality in OpenSearch PPL

Summary

Background

Current PPL `stats` function capabilities:

Additional timechart capabilities we want to add:

Proposed Implementation

Complete Timechart Command Syntax

Detailed Feature Requirements

1. Core Time Binning Parameters

`span=<value>` - Fixed interval binning

`bins=<number>` - Fixed number of equal-width bins

2. Series Management

`limit=<int>` - Control number of series

`useother=<boolean>` - Handle long tail series

3. Gap Filling and Continuity - Future Implementation

`usenull=<boolean>` - Null value handling

`cont=<boolean>` - Fill time gaps

Implementation Requirements

Core Components

Key Features

Performance Considerations

Usage Examples

Basic Time Series

Test Coverage Required

Documentation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

@timestamp	Age	Count
2023-01-01T01:00:00	50	3
2023-01-01T01:00:00	40	2
2023-01-01T01:00:00	OTHER	0
2023-01-01T02:00:00	50	2
2023-01-01T02:00:00	40	3
2023-01-01T02:00:00	OTHER	0
2023-01-01T03:00:00	50	1
2023-01-01T03:00:00	40	1
2023-01-01T03:00:00	OTHER	2

[RFC] Support timechart command in PPL #3965

Description

Implement timechart command functionality in OpenSearch PPL

Summary

Background

Current PPL stats function capabilities:

Additional timechart capabilities we want to add:

Proposed Implementation

Complete Timechart Command Syntax

Detailed Feature Requirements

1. Core Time Binning Parameters

span=<value> - Fixed interval binning

bins=<number> - Fixed number of equal-width bins

2. Series Management

limit=<int> - Control number of series

useother=<boolean> - Handle long tail series

3. Gap Filling and Continuity - Future Implementation

usenull=<boolean> - Null value handling

cont=<boolean> - Fill time gaps

Implementation Requirements

Core Components

Key Features

Performance Considerations

Usage Examples

Basic Time Series

Test Coverage Required

Documentation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] Support `timechart` command in PPL #3965

Implement `timechart` command functionality in OpenSearch PPL

Current PPL `stats` function capabilities:

`span=<value>` - Fixed interval binning

`bins=<number>` - Fixed number of equal-width bins

`limit=<int>` - Control number of series

`useother=<boolean>` - Handle long tail series

`usenull=<boolean>` - Null value handling

`cont=<boolean>` - Fill time gaps