Skip to content

Common expression elimination should also re-find the correct expression, during re-write. #9870

@wiedld

Description

@wiedld

Describe the bug

The common-expression-elimination optimizer errors when there are two aggr_fn(<short_circuit>) projections. Specifically, we have a reproducer with a failure when running:

statement ok
CREATE TABLE t1(
    time TIMESTAMP,
    load1 DOUBLE,
    load2 DOUBLE,
    host VARCHAR
) AS VALUES
  (to_timestamp_nanos(1527018806000000000), 1.1, 101, 'host1'),
  (to_timestamp_nanos(1527018806000000000), 2.2, 202, 'host2'),
  (to_timestamp_nanos(1527018806000000000), 3.3, 303, 'host3'),
  (to_timestamp_nanos(1527018806000000000), 1.1, 101, NULL)
;

# cannot have 2 projections with aggr(short_circuited), with the same short-circuited expr (e.g. CASE WHEN)
statement error DataFusion error: Optimizer rule 'common_sub_expression_eliminate' failed
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c1']), sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select struct(time,load1,load2,host) from t1) t2 where t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;

We isolated this error to how the IdArray is generated and used. The IdArray is generated by insertion-at-index during traversal of a first visitor. This IdArray is then read during the second visitor, also by an index, but this index is incremented differently as this second visitor traverses.

As such, the second visitor finds the wrong expr, and then inserts the wrong expression. We could make small changes to fix our bug, but then broke other statements. We feel that the index-based lookup may be inherently fragile to slight changes in traversal patterns across the two visitors; as such, we have a proposed alternative which will be up in PR shortly.

To Reproduce

Full reproducer:

statement ok
CREATE TABLE t1(
    time TIMESTAMP,
    load1 DOUBLE,
    load2 DOUBLE,
    host VARCHAR
) AS VALUES
  (to_timestamp_nanos(1527018806000000000), 1.1, 101, 'host1'),
  (to_timestamp_nanos(1527018806000000000), 2.2, 202, 'host2'),
  (to_timestamp_nanos(1527018806000000000), 3.3, 303, 'host3'),
  (to_timestamp_nanos(1527018806000000000), 1.1, 101, NULL)
;

# struct scalar function with columns
query ?
select struct(time,load1,load2,host) from t1;
----
{c0: 2018-05-22T19:53:26, c1: 1.1, c2: 101.0, c3: host1}
{c0: 2018-05-22T19:53:26, c1: 2.2, c2: 202.0, c3: host2}
{c0: 2018-05-22T19:53:26, c1: 3.3, c2: 303.0, c3: host3}
{c0: 2018-05-22T19:53:26, c1: 1.1, c2: 101.0, c3: }

# can have an aggregate function with an inner coalesce
query TR
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']) from (select struct(time,load1,load2,host) from t1) t2 where t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
----
host1 1.1
host2 2.2
host3 3.3

# can have an aggregate function with an inner CASE WHEN
query TR
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select struct(time,load1,load2,host) from t1) t2 where t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
----
host1 101
host2 202
host3 303

# can have 2 projections with aggr(short_circuited), with different short-circuited expr
query TRR
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']), sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select struct(time,load1,load2,host) from t1) t2 where t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
----
host1 1.1 101
host2 2.2 202
host3 3.3 303

# can have 2 projections with aggr(short_circuited), with the same short-circuited expr (e.g. CASE WHEN)
statement error DataFusion error: Optimizer rule 'common_sub_expression_eliminate' failed
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c1']), sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select struct(time,load1,load2,host) from t1) t2 where t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;

# can have 2 projections with aggr(short_circuited), with the same short-circuited expr (e.g. coalesce)
statement error DataFusion error: Optimizer rule 'common_sub_expression_eliminate' failed
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']), sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c2']) from (select struct(time,load1,load2,host) from t1) t2 where t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;

Expected behavior

Should succeed on all of the provided test cases. Should not lookup with wrong expr, and insert the wrong expression, during the common-expression-elimination re-writer.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionSomething that used to work no longer does

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions