-
Notifications
You must be signed in to change notification settings - Fork 18
DataFusion 47.0.0 blog post #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Yongting You <[email protected]>
Co-authored-by: Yongting You <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Phillip LeBlanc <[email protected]>
|
Starting to check this out |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359 -- this is great. I pushed some wording updates and formatting fixes
Was looking like this
Which now looks like this
Thanks again -- this is super great
|
Thanks. Odd that RustRover rendered it differently but the wording is definitely better :) |
Yeah, the pelicanasf rendered is pretty wonky and non standard (also doesn't like markdown tables for some reason 🤷 ) |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! one nit
Co-authored-by: Kevin Liu <[email protected]>
|
|
||
| Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104) | ||
|
|
||
| ### Tracing context propagation in spawned tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning this change! You can also maybe link to the related datafusion-contrib repo https://github.com/datafusion-contrib/datafusion-tracing which builds upon this, otherwise the description might be a bit too abstract :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the post much stronger
| 2 row(s) fetched. | ||
| ``` | ||
|
|
||
| Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @irenjj
| DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy | ||
| optimizations in this release: | ||
|
|
||
| - **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @UBarney
| - **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266) | ||
| and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)). | ||
|
|
||
| - **`MIN`, `MAX` and `AVG` for Durations:** DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @shruti2522
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @Omega359 -- let's get this published as we have a backlog of content ready to rock and I don't want to drop too many blogs the same day!
| - **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of | ||
| the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in | ||
| [significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases). | ||
| (PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @acking-you
| optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. | ||
| (PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)). | ||
|
|
||
| - **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @zebsme
|
|
||
| Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases. | ||
|
|
||
| Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @zhuqi-lucas
|
The blog is live! https://datafusion.apache.org/blog/2025/07/11/datafusion-47.0.0/ |
First cut at a DF 47 blog post as
47.0.0(April 2025) datafusion#15072Please let me know of anything you wish to add/modify