Skip to content

Feat: Revive to use upstream arrow coalesce #17105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Aug 9, 2025

Which issue does this PR close?

Rationale for this change

And fix conflicts

What changes are included in this PR?

This PR refactors the BatchCoalescer in DataFusion to use the proposed upstream API to show that it

Can be used (api is complete enough)
Is not any slower

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Aug 9, 2025
@zhuqi-lucas zhuqi-lucas changed the title Revive to use upstream arrow coalesce Feat: Revive to use upstream arrow coalesce (https://github.com/apache/datafusion/pull/16249) Aug 9, 2025
@zhuqi-lucas zhuqi-lucas changed the title Feat: Revive to use upstream arrow coalesce (https://github.com/apache/datafusion/pull/16249) Feat: Revive to use upstream arrow coalesce [original PR](https://github.com/apache/datafusion/pull/16249) Aug 9, 2025
@zhuqi-lucas zhuqi-lucas changed the title Feat: Revive to use upstream arrow coalesce [original PR](https://github.com/apache/datafusion/pull/16249) Feat: Revive to use upstream arrow coalesce (original PR)[https://github.com/apache/datafusion/pull/16249] Aug 9, 2025
@zhuqi-lucas zhuqi-lucas changed the title Feat: Revive to use upstream arrow coalesce (original PR)[https://github.com/apache/datafusion/pull/16249] Feat: Revive to use upstream arrow coalesce Aug 9, 2025
@github-actions github-actions bot removed the core Core DataFusion crate label Aug 9, 2025
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Aug 9, 2025
@alamb
Copy link
Contributor

alamb commented Aug 9, 2025

I was just thinking about this PR last night -- thank you @zhuqi-lucas -- I'll kick off the benchmarks just to make sure

@alamb
Copy link
Contributor

alamb commented Aug 9, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing revive_to_use_upstream_arrow_coalesce (bac0197) to 9264bb8 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb
alamb previously approved these changes Aug 9, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- this looks great

As long as the benchmarks look good (I expect no substantial change) I think we should merge this

Poll::Ready(Some(Ok(batch)))
};
Some(Ok(batch)) => {
if self.coalescer.push_batch(batch)? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this API to be somewhat confusing (the fact that a true return value means limit was reached)

Maybe returning an enum would be clearer here

I don't think this is a correctness issue, just a readability thing I noticed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for good suggestion, this is a better way i agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed it in latest PR, it returns an enum now.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Aug 9, 2025

Thank you @zhuqi-lucas -- this looks great

As long as the benchmarks look good (I expect no substantial change) I think we should merge this

Thank you @alamb for review, my follow-up plan, correct me if i am wrong:

  1. I will try to address this comments from @Dandandan Draft: Use upstream arrow coalesce kernel in DataFusion #16249 (comment) for this PR or follow-up to improve performance.
  2. Optimize the push_batch_with_filter performance, i can start from primitive type:
    [coalesce] Implement specialized push_batch_with_filter for primitive array arrow-rs#7762
  3. I can investigate try to change filter_exec with coalesce_batch_exec to coalesce_batch_exec(with filter) in single operator which can use upstream push_batch_with_filter, we may can remove coalesce_batch_exec operator in future?
  4. Apply to other combination with coalesce_batch_exec, such HashJoinExec, RepartitionExec, etc. But we also need to implement corresponding upstream logic before this.

@alamb
Copy link
Contributor

alamb commented Aug 9, 2025

🤖: Benchmark completed

Details

Comparing HEAD and revive_to_use_upstream_arrow_coalesce
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1978.72 ms │                            1944.77 ms │     no change │
│ QQuery 1     │   771.03 ms │                             643.75 ms │ +1.20x faster │
│ QQuery 2     │  1504.59 ms │                            1283.66 ms │ +1.17x faster │
│ QQuery 3     │   618.98 ms │                             618.31 ms │     no change │
│ QQuery 4     │  1307.14 ms │                            1350.94 ms │     no change │
│ QQuery 5     │ 13852.75 ms │                           13996.89 ms │     no change │
│ QQuery 6     │  2285.06 ms │                            2155.42 ms │ +1.06x faster │
│ QQuery 7     │  1795.16 ms │                            1819.77 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 24113.42ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 23813.51ms │
│ Average Time (HEAD)                                  │  3014.18ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │  2976.69ms │
│ Queries Faster                                       │          3 │
│ Queries Slower                                       │          0 │
│ Queries with No Change                               │          5 │
│ Queries with Failure                                 │          0 │
└──────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.35 ms │                               2.64 ms │  1.12x slower │
│ QQuery 1     │    27.83 ms │                              29.15 ms │     no change │
│ QQuery 2     │    74.71 ms │                              73.36 ms │     no change │
│ QQuery 3     │    91.03 ms │                              89.84 ms │     no change │
│ QQuery 4     │   595.96 ms │                             602.12 ms │     no change │
│ QQuery 5     │   860.37 ms │                             835.65 ms │     no change │
│ QQuery 6     │     2.20 ms │                               2.20 ms │     no change │
│ QQuery 7     │    31.81 ms │                              32.72 ms │     no change │
│ QQuery 8     │   842.59 ms │                             846.27 ms │     no change │
│ QQuery 9     │  1160.15 ms │                            1133.52 ms │     no change │
│ QQuery 10    │   223.96 ms │                             213.82 ms │     no change │
│ QQuery 11    │   251.04 ms │                             242.97 ms │     no change │
│ QQuery 12    │   826.48 ms │                             815.01 ms │     no change │
│ QQuery 13    │  1193.58 ms │                            1188.55 ms │     no change │
│ QQuery 14    │   778.58 ms │                             771.01 ms │     no change │
│ QQuery 15    │   765.85 ms │                             772.44 ms │     no change │
│ QQuery 16    │  1576.56 ms │                            1578.18 ms │     no change │
│ QQuery 17    │  1580.58 ms │                            1578.56 ms │     no change │
│ QQuery 18    │  2820.92 ms │                            2842.70 ms │     no change │
│ QQuery 19    │    80.25 ms │                              79.64 ms │     no change │
│ QQuery 20    │  1152.98 ms │                            1100.78 ms │     no change │
│ QQuery 21    │  1302.14 ms │                            1259.29 ms │     no change │
│ QQuery 22    │  2136.74 ms │                            2064.58 ms │     no change │
│ QQuery 23    │  7499.03 ms │                            7311.46 ms │     no change │
│ QQuery 24    │   395.37 ms │                             376.85 ms │     no change │
│ QQuery 25    │   274.14 ms │                             254.31 ms │ +1.08x faster │
│ QQuery 26    │   395.58 ms │                             375.71 ms │ +1.05x faster │
│ QQuery 27    │  1519.29 ms │                            1518.21 ms │     no change │
│ QQuery 28    │ 11788.81 ms │                           12181.37 ms │     no change │
│ QQuery 29    │   526.17 ms │                             513.72 ms │     no change │
│ QQuery 30    │   753.70 ms │                             744.32 ms │     no change │
│ QQuery 31    │   768.23 ms │                             769.84 ms │     no change │
│ QQuery 32    │  2363.83 ms │                            2389.26 ms │     no change │
│ QQuery 33    │  3147.88 ms │                            3116.51 ms │     no change │
│ QQuery 34    │  3196.56 ms │                            3154.66 ms │     no change │
│ QQuery 35    │  1213.64 ms │                            1200.73 ms │     no change │
│ QQuery 36    │   123.40 ms │                             124.99 ms │     no change │
│ QQuery 37    │    54.61 ms │                              50.30 ms │ +1.09x faster │
│ QQuery 38    │   121.09 ms │                             118.44 ms │     no change │
│ QQuery 39    │   198.96 ms │                             192.82 ms │     no change │
│ QQuery 40    │    41.06 ms │                              40.60 ms │     no change │
│ QQuery 41    │    40.71 ms │                              40.11 ms │     no change │
│ QQuery 42    │    35.87 ms │                              36.53 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 52836.59ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 52665.74ms │
│ Average Time (HEAD)                                  │  1228.76ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │  1224.78ms │
│ Queries Faster                                       │          3 │
│ Queries Slower                                       │          1 │
│ Queries with No Change                               │         39 │
│ Queries with Failure                                 │          0 │
└──────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │  94.69 ms │                              93.93 ms │    no change │
│ QQuery 2     │  20.98 ms │                              22.54 ms │ 1.07x slower │
│ QQuery 3     │  32.02 ms │                              37.29 ms │ 1.16x slower │
│ QQuery 4     │  18.45 ms │                              20.03 ms │ 1.09x slower │
│ QQuery 5     │  48.49 ms │                              54.68 ms │ 1.13x slower │
│ QQuery 6     │  11.94 ms │                              11.63 ms │    no change │
│ QQuery 7     │  85.30 ms │                             100.53 ms │ 1.18x slower │
│ QQuery 8     │  26.25 ms │                              28.29 ms │ 1.08x slower │
│ QQuery 9     │  55.43 ms │                              58.77 ms │ 1.06x slower │
│ QQuery 10    │  39.79 ms │                              43.39 ms │ 1.09x slower │
│ QQuery 11    │  10.97 ms │                              11.70 ms │ 1.07x slower │
│ QQuery 12    │  29.20 ms │                              30.29 ms │    no change │
│ QQuery 13    │  25.57 ms │                              26.68 ms │    no change │
│ QQuery 14    │   9.50 ms │                              10.18 ms │ 1.07x slower │
│ QQuery 15    │  18.37 ms │                              18.72 ms │    no change │
│ QQuery 16    │  17.21 ms │                              18.79 ms │ 1.09x slower │
│ QQuery 17    │  94.62 ms │                              98.44 ms │    no change │
│ QQuery 18    │ 175.22 ms │                             174.39 ms │    no change │
│ QQuery 19    │  23.86 ms │                              25.42 ms │ 1.07x slower │
│ QQuery 20    │  30.88 ms │                              33.14 ms │ 1.07x slower │
│ QQuery 21    │ 140.56 ms │                             157.40 ms │ 1.12x slower │
│ QQuery 22    │  15.25 ms │                              16.53 ms │ 1.08x slower │
└──────────────┴───────────┴───────────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 1024.57ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 1092.74ms │
│ Average Time (HEAD)                                  │   46.57ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │   49.67ms │
│ Queries Faster                                       │         0 │
│ Queries Slower                                       │        15 │
│ Queries with No Change                               │         7 │
│ Queries with Failure                                 │         0 │
└──────────────────────────────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor Author

🤖: Benchmark completed

Details

Comparing HEAD and revive_to_use_upstream_arrow_coalesce
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1978.72 ms │                            1944.77 ms │     no change │
│ QQuery 1     │   771.03 ms │                             643.75 ms │ +1.20x faster │
│ QQuery 2     │  1504.59 ms │                            1283.66 ms │ +1.17x faster │
│ QQuery 3     │   618.98 ms │                             618.31 ms │     no change │
│ QQuery 4     │  1307.14 ms │                            1350.94 ms │     no change │
│ QQuery 5     │ 13852.75 ms │                           13996.89 ms │     no change │
│ QQuery 6     │  2285.06 ms │                            2155.42 ms │ +1.06x faster │
│ QQuery 7     │  1795.16 ms │                            1819.77 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 24113.42ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 23813.51ms │
│ Average Time (HEAD)                                  │  3014.18ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │  2976.69ms │
│ Queries Faster                                       │          3 │
│ Queries Slower                                       │          0 │
│ Queries with No Change                               │          5 │
│ Queries with Failure                                 │          0 │
└──────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.35 ms │                               2.64 ms │  1.12x slower │
│ QQuery 1     │    27.83 ms │                              29.15 ms │     no change │
│ QQuery 2     │    74.71 ms │                              73.36 ms │     no change │
│ QQuery 3     │    91.03 ms │                              89.84 ms │     no change │
│ QQuery 4     │   595.96 ms │                             602.12 ms │     no change │
│ QQuery 5     │   860.37 ms │                             835.65 ms │     no change │
│ QQuery 6     │     2.20 ms │                               2.20 ms │     no change │
│ QQuery 7     │    31.81 ms │                              32.72 ms │     no change │
│ QQuery 8     │   842.59 ms │                             846.27 ms │     no change │
│ QQuery 9     │  1160.15 ms │                            1133.52 ms │     no change │
│ QQuery 10    │   223.96 ms │                             213.82 ms │     no change │
│ QQuery 11    │   251.04 ms │                             242.97 ms │     no change │
│ QQuery 12    │   826.48 ms │                             815.01 ms │     no change │
│ QQuery 13    │  1193.58 ms │                            1188.55 ms │     no change │
│ QQuery 14    │   778.58 ms │                             771.01 ms │     no change │
│ QQuery 15    │   765.85 ms │                             772.44 ms │     no change │
│ QQuery 16    │  1576.56 ms │                            1578.18 ms │     no change │
│ QQuery 17    │  1580.58 ms │                            1578.56 ms │     no change │
│ QQuery 18    │  2820.92 ms │                            2842.70 ms │     no change │
│ QQuery 19    │    80.25 ms │                              79.64 ms │     no change │
│ QQuery 20    │  1152.98 ms │                            1100.78 ms │     no change │
│ QQuery 21    │  1302.14 ms │                            1259.29 ms │     no change │
│ QQuery 22    │  2136.74 ms │                            2064.58 ms │     no change │
│ QQuery 23    │  7499.03 ms │                            7311.46 ms │     no change │
│ QQuery 24    │   395.37 ms │                             376.85 ms │     no change │
│ QQuery 25    │   274.14 ms │                             254.31 ms │ +1.08x faster │
│ QQuery 26    │   395.58 ms │                             375.71 ms │ +1.05x faster │
│ QQuery 27    │  1519.29 ms │                            1518.21 ms │     no change │
│ QQuery 28    │ 11788.81 ms │                           12181.37 ms │     no change │
│ QQuery 29    │   526.17 ms │                             513.72 ms │     no change │
│ QQuery 30    │   753.70 ms │                             744.32 ms │     no change │
│ QQuery 31    │   768.23 ms │                             769.84 ms │     no change │
│ QQuery 32    │  2363.83 ms │                            2389.26 ms │     no change │
│ QQuery 33    │  3147.88 ms │                            3116.51 ms │     no change │
│ QQuery 34    │  3196.56 ms │                            3154.66 ms │     no change │
│ QQuery 35    │  1213.64 ms │                            1200.73 ms │     no change │
│ QQuery 36    │   123.40 ms │                             124.99 ms │     no change │
│ QQuery 37    │    54.61 ms │                              50.30 ms │ +1.09x faster │
│ QQuery 38    │   121.09 ms │                             118.44 ms │     no change │
│ QQuery 39    │   198.96 ms │                             192.82 ms │     no change │
│ QQuery 40    │    41.06 ms │                              40.60 ms │     no change │
│ QQuery 41    │    40.71 ms │                              40.11 ms │     no change │
│ QQuery 42    │    35.87 ms │                              36.53 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 52836.59ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 52665.74ms │
│ Average Time (HEAD)                                  │  1228.76ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │  1224.78ms │
│ Queries Faster                                       │          3 │
│ Queries Slower                                       │          1 │
│ Queries with No Change                               │         39 │
│ Queries with Failure                                 │          0 │
└──────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │  94.69 ms │                              93.93 ms │    no change │
│ QQuery 2     │  20.98 ms │                              22.54 ms │ 1.07x slower │
│ QQuery 3     │  32.02 ms │                              37.29 ms │ 1.16x slower │
│ QQuery 4     │  18.45 ms │                              20.03 ms │ 1.09x slower │
│ QQuery 5     │  48.49 ms │                              54.68 ms │ 1.13x slower │
│ QQuery 6     │  11.94 ms │                              11.63 ms │    no change │
│ QQuery 7     │  85.30 ms │                             100.53 ms │ 1.18x slower │
│ QQuery 8     │  26.25 ms │                              28.29 ms │ 1.08x slower │
│ QQuery 9     │  55.43 ms │                              58.77 ms │ 1.06x slower │
│ QQuery 10    │  39.79 ms │                              43.39 ms │ 1.09x slower │
│ QQuery 11    │  10.97 ms │                              11.70 ms │ 1.07x slower │
│ QQuery 12    │  29.20 ms │                              30.29 ms │    no change │
│ QQuery 13    │  25.57 ms │                              26.68 ms │    no change │
│ QQuery 14    │   9.50 ms │                              10.18 ms │ 1.07x slower │
│ QQuery 15    │  18.37 ms │                              18.72 ms │    no change │
│ QQuery 16    │  17.21 ms │                              18.79 ms │ 1.09x slower │
│ QQuery 17    │  94.62 ms │                              98.44 ms │    no change │
│ QQuery 18    │ 175.22 ms │                             174.39 ms │    no change │
│ QQuery 19    │  23.86 ms │                              25.42 ms │ 1.07x slower │
│ QQuery 20    │  30.88 ms │                              33.14 ms │ 1.07x slower │
│ QQuery 21    │ 140.56 ms │                             157.40 ms │ 1.12x slower │
│ QQuery 22    │  15.25 ms │                              16.53 ms │ 1.08x slower │
└──────────────┴───────────┴───────────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 1024.57ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 1092.74ms │
│ Average Time (HEAD)                                  │   46.57ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │   49.67ms │
│ Queries Faster                                       │         0 │
│ Queries Slower                                       │        15 │
│ Queries with No Change                               │         7 │
│ Queries with Failure                                 │         0 │
└──────────────────────────────────────────────────────┴───────────┘

The clickbench result is good, but tpch_mem seems some regression from the benchmark result. 🤔

@alamb
Copy link
Contributor

alamb commented Aug 11, 2025

The clickbench result is good, but tpch_mem seems some regression from the benchmark result. 🤔

Weird, I will rerun and see if we can see it again

@alamb
Copy link
Contributor

alamb commented Aug 11, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing revive_to_use_upstream_arrow_coalesce (8bbadaf) to ab794d2 diff using: tpch_mem
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 11, 2025

🤖: Benchmark completed

Details

Comparing HEAD and revive_to_use_upstream_arrow_coalesce
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │  95.61 ms │                              92.56 ms │    no change │
│ QQuery 2     │  20.85 ms │                              22.44 ms │ 1.08x slower │
│ QQuery 3     │  31.93 ms │                              36.28 ms │ 1.14x slower │
│ QQuery 4     │  18.15 ms │                              19.48 ms │ 1.07x slower │
│ QQuery 5     │  48.85 ms │                              53.88 ms │ 1.10x slower │
│ QQuery 6     │  12.03 ms │                              11.62 ms │    no change │
│ QQuery 7     │  88.27 ms │                              93.17 ms │ 1.06x slower │
│ QQuery 8     │  24.33 ms │                              24.10 ms │    no change │
│ QQuery 9     │  53.84 ms │                              55.02 ms │    no change │
│ QQuery 10    │  40.33 ms │                              41.29 ms │    no change │
│ QQuery 11    │  11.20 ms │                              11.65 ms │    no change │
│ QQuery 12    │  29.78 ms │                              29.46 ms │    no change │
│ QQuery 13    │  25.89 ms │                              26.05 ms │    no change │
│ QQuery 14    │   9.66 ms │                              10.02 ms │    no change │
│ QQuery 15    │  19.14 ms │                              18.82 ms │    no change │
│ QQuery 16    │  17.42 ms │                              17.74 ms │    no change │
│ QQuery 17    │  96.53 ms │                              95.86 ms │    no change │
│ QQuery 18    │ 179.04 ms │                             179.63 ms │    no change │
│ QQuery 19    │  24.01 ms │                              26.11 ms │ 1.09x slower │
│ QQuery 20    │  31.57 ms │                              33.97 ms │ 1.08x slower │
│ QQuery 21    │ 139.56 ms │                             151.41 ms │ 1.08x slower │
│ QQuery 22    │  13.78 ms │                              14.77 ms │ 1.07x slower │
└──────────────┴───────────┴───────────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 1031.76ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 1065.32ms │
│ Average Time (HEAD)                                  │   46.90ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │   48.42ms │
│ Queries Faster                                       │         0 │
│ Queries Slower                                       │         9 │
│ Queries with No Change                               │        13 │
│ Queries with Failure                                 │         0 │
└──────────────────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Aug 11, 2025

🤔 the new kernel seems to slow down. I wonder if the overhead of precisely sized output batches is causing the issue

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Aug 11, 2025

🤔 the new kernel seems to slow down. I wonder if the overhead of precisely sized output batches is causing the issue

Good point @alamb , i agree this is the only difference. I can add a test PR to make upstream do not generate precisely sized output batches, but when we ensure capacity for the increment buffer size, it seems we need to make the size change since we do not keep the same target size for this change.

The latest benchmark seems a little better.

🤖: Benchmark completed

Details

Comparing HEAD and revive_to_use_upstream_arrow_coalesce
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ revive_to_use_upstream_arrow_coalesce ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │  95.61 ms │                              92.56 ms │    no change │
│ QQuery 2     │  20.85 ms │                              22.44 ms │ 1.08x slower │
│ QQuery 3     │  31.93 ms │                              36.28 ms │ 1.14x slower │
│ QQuery 4     │  18.15 ms │                              19.48 ms │ 1.07x slower │
│ QQuery 5     │  48.85 ms │                              53.88 ms │ 1.10x slower │
│ QQuery 6     │  12.03 ms │                              11.62 ms │    no change │
│ QQuery 7     │  88.27 ms │                              93.17 ms │ 1.06x slower │
│ QQuery 8     │  24.33 ms │                              24.10 ms │    no change │
│ QQuery 9     │  53.84 ms │                              55.02 ms │    no change │
│ QQuery 10    │  40.33 ms │                              41.29 ms │    no change │
│ QQuery 11    │  11.20 ms │                              11.65 ms │    no change │
│ QQuery 12    │  29.78 ms │                              29.46 ms │    no change │
│ QQuery 13    │  25.89 ms │                              26.05 ms │    no change │
│ QQuery 14    │   9.66 ms │                              10.02 ms │    no change │
│ QQuery 15    │  19.14 ms │                              18.82 ms │    no change │
│ QQuery 16    │  17.42 ms │                              17.74 ms │    no change │
│ QQuery 17    │  96.53 ms │                              95.86 ms │    no change │
│ QQuery 18    │ 179.04 ms │                             179.63 ms │    no change │
│ QQuery 19    │  24.01 ms │                              26.11 ms │ 1.09x slower │
│ QQuery 20    │  31.57 ms │                              33.97 ms │ 1.08x slower │
│ QQuery 21    │ 139.56 ms │                             151.41 ms │ 1.08x slower │
│ QQuery 22    │  13.78 ms │                              14.77 ms │ 1.07x slower │
└──────────────┴───────────┴───────────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                    │ 1031.76ms │
│ Total Time (revive_to_use_upstream_arrow_coalesce)   │ 1065.32ms │
│ Average Time (HEAD)                                  │   46.90ms │
│ Average Time (revive_to_use_upstream_arrow_coalesce) │   48.42ms │
│ Queries Faster                                       │         0 │
│ Queries Slower                                       │         9 │
│ Queries with No Change                               │        13 │
│ Queries with Failure                                 │         0 │
└──────────────────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Aug 11, 2025

🤔 the new kernel seems to slow down. I wonder if the overhead of precisely sized output batches is causing the issue

Good point @alamb , i agree this is the only difference. I can add a test PR to make upstream do not generate precisely sized output batches, but when we ensure capacity for the increment buffer size, it seems we need to make the size change since we do not keep the same target size for this change.

The latest benchmark seems a little better.

Thanks @zhuqi-lucas -- what I was thinking about was something like the following

let target_batch_size = 4;
let mut coalescer = BatchCoalescer::new(batch1.schema(), 4)
  .with_exact_size(false)

Before we spend a lot of time polishing / testing a PR for that it would probably be good to hack up a POC and verify it actually improves performance

Thank you for your willingness to help along with this project. It is something I have thought was important (but not critical) for a long time and so having someone to help really makes a big difference

@zhuqi-lucas
Copy link
Contributor Author

🤔 the new kernel seems to slow down. I wonder if the overhead of precisely sized output batches is causing the issue

Good point @alamb , i agree this is the only difference. I can add a test PR to make upstream do not generate precisely sized output batches, but when we ensure capacity for the increment buffer size, it seems we need to make the size change since we do not keep the same target size for this change.
The latest benchmark seems a little better.

Thanks @zhuqi-lucas -- what I was thinking about was something like the following

let target_batch_size = 4;
let mut coalescer = BatchCoalescer::new(batch1.schema(), 4)
  .with_exact_size(false)

Before we spend a lot of time polishing / testing a PR for that it would probably be good to hack up a POC and verify it actually improves performance

Thank you for your willingness to help along with this project. It is something I have thought was important (but not critical) for a long time and so having someone to help really makes a big difference

Thank you @alamb for good suggestion! It looks pretty cool to me, and a config for this is very clever idea.

let target_batch_size = 4;
let mut coalescer = BatchCoalescer::new(batch1.schema(), 4)
  .with_exact_size(false)

I will try to address this for upstream first, so we can easily testing it for datafusion.

@zhuqi-lucas
Copy link
Contributor Author

Updated @alamb , i created the PR for non-exact size now:

#17136

@alamb alamb dismissed their stale review August 12, 2025 20:14

still working on performance

@2010YOUY01
Copy link
Contributor

For the tpch_mem slowdown, another possible reason could be unnecessary copies for batches that are exactly batch_size.

For certain operators, there might already be an internal mechanism to ensure their output is exactly batch_size. From a quick look at the implementation, the old version could pass such batches through directly, whereas this PR forces them to be copied.

Another potential improvement: could we make this pass-through threshold more lenient? For example, if the coalescer receives a batch with size >= batch_size / 2, it could pass it through without coalescing. In such cases, the output size is already large enough to benefit from vectorization, so the extra concatenation might not add much value.

@zhuqi-lucas
Copy link
Contributor Author

For the tpch_mem slowdown, another possible reason could be unnecessary copies for batches that are exactly batch_size.

For certain operators, there might already be an internal mechanism to ensure their output is exactly batch_size. From a quick look at the implementation, the old version could pass such batches through directly, whereas this PR forces them to be copied.

Another potential improvement: could we make this pass-through threshold more lenient? For example, if the coalescer receives a batch with size >= batch_size / 2, it could pass it through without coalescing. In such cases, the output size is already large enough to benefit from vectorization, so the extra concatenation might not add much value.

Thank you @2010YOUY01 for review, good suggestion!
It looks like similar to this comments:
#16249 (comment)

I will try to address it, and we can get the new benchmark result!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants