DataFusion 47.0.0 blog post #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

alamb merged 32 commits into apache:main from Omega359:main

Jul 11, 2025

Contributor

Omega359 commented Jul 6, 2025 •

edited by alamb

Loading

Part of Blog post for the DataFusion 47, 48, and 49 releases datafusion#16347

First cut at a DF 47 blog post as

as mentioned in Release DataFusion 47.0.0 (April 2025) datafusion#15072

Please let me know of anything you wish to add/modify

Omega359 and others added 24 commits

February 20, 2025 12:35


          DF 45 blog post

0d5866f


          Update content/blog/2025-02-20-datafusion-45.0.0.md

7417e4c

Co-authored-by: Yongting You <[email protected]>


          Update content/blog/2025-02-20-datafusion-45.0.0.md

Co-authored-by: Yongting You <[email protected]>


          Set author to PMC.

88d7b6b


          Set author to PMC, incorporated feedback.

5a13332


          Update content/blog/2025-02-20-datafusion-45.0.0.md

b8de014

Co-authored-by: Andrew Lamb <[email protected]>


          expanded GSOC as it may not be obvious what it is and linked it up.

8a46200


          Grammar fix.

5460e1a


          Typo fix

a2e3503


          Typo fix

e8e6734


          Adding spark functions to looking ahead section

7bb8713


          minor change

b330523


          Fixed Jonah Gao's handle.

3b9b11d


          Update content/blog/2025-02-20-datafusion-45.0.0.md

29af566

Co-authored-by: Phillip LeBlanc <[email protected]>


          WIP for DF 49 blog post.

d49a65c


          WIP for DF 49 blog post.


          Update topK dynamic filtering perf section, cleanup the upgrade and c…

6102ade

…hangelog section


          Merge remote-tracking branch 'upstream/main' into origin_main

cbbad27


          DF 47.0.0 blog post

3ece66e


          Remove incomplete and accidentally added DF 49 blog post

ef46a35


          Fix header.

9e0f4e1


          Grammar fix

4f049f4


          Minor formatting

9780da0


          Adding disabling of re-validation of spill files to performance impro…

cdf50f8

…vements

Omega359 mentioned this pull request

Blog post for the DataFusion 47, 48, and 49 releases apache/datafusion#16347

Closed

Contributor

alamb commented Jul 9, 2025

Starting to check this out

alamb added 3 commits

July 9, 2025 07:18


          Merge remote-tracking branch 'apache/main' into Omega359/main

1478e3d


          Formatting and wordsmithing

3859c80


          tweaks

a149ac8

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

Thank you @Omega359 -- this is great. I pushed some wording updates and formatting fixes

Was looking like this

Screenshot 2025-07-09 at 7 26 25 AM

Which now looks like this

Screenshot 2025-07-09 at 8 17 51 AM

Thanks again -- this is super great

Contributor Author

Omega359 commented Jul 9, 2025

Thanks. Odd that RustRover rendered it differently but the wording is definitely better :)

Contributor

alamb commented Jul 9, 2025

Thanks. Odd that RustRover rendered it differently but the wording is definitely better :)

Yeah, the pelicanasf rendered is pretty wonky and non standard (also doesn't like markdown tables for some reason 🤷 )

kevinjqliu approved these changes

View reviewed changes

Contributor

kevinjqliu left a comment

LGTM! one nit

content/blog/2025-07-10-datafusion-47.0.0.md Outdated Show resolved Hide resolved

alamb and others added 2 commits

July 9, 2025 16:18


          Update content/blog/2025-07-10-datafusion-47.0.0.md

d433e7d

Co-authored-by: Kevin Liu <[email protected]>


          Fixed link.

af1c645

geoffreyclaude reviewed

View reviewed changes

content/blog/2025-07-10-datafusion-47.0.0.md Outdated


		Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104)

		### Tracing context propagation in spawned tasks

geoffreyclaude Jul 10, 2025 •

edited

Loading

Thanks for mentioning this change! You can also maybe link to the related datafusion-contrib repo https://github.com/datafusion-contrib/datafusion-tracing which builds upon this, otherwise the description might be a bit too abstract :)

Contributor

alamb Jul 11, 2025

I am adding this now

Contributor

alamb Jul 11, 2025

It makes the post much stronger

alamb added 3 commits

July 11, 2025 06:55


          Add datafusion-tracing crate mention and logo, make text more concrete

88ad7c6


          Claude edits

5f17467


          Update publishing date

44aa48f

alamb reviewed

View reviewed changes

content/blog/2025-07-11-datafusion-47.0.0.md

+row(s) fetched.
+              ```
+              Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677)

Contributor

alamb Jul 11, 2025

alamb reviewed

View reviewed changes

content/blog/2025-07-11-datafusion-47.0.0.md

+              DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy
+              optimizations in this release:
+              - **`FIRST_VALUE` and `LAST_VALUE`:**  `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266)

Contributor

alamb Jul 11, 2025

alamb reviewed

View reviewed changes

content/blog/2025-07-11-datafusion-47.0.0.md

+              - **`FIRST_VALUE` and `LAST_VALUE`:**  `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266)
+                and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)).
+              - **`MIN`, `MAX` and `AVG` for Durations:**  DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns.

Contributor

alamb Jul 11, 2025

FYI @shruti2522

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

Thanks again @Omega359 -- let's get this published as we have a backlog of content ready to rock and I don't want to drop too many blogs the same day!

content/blog/2025-07-11-datafusion-47.0.0.md

+              - **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of
+                the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in
+                [significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases).
+                (PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694)

Contributor

alamb Jul 11, 2025

FYI @acking-you

content/blog/2025-07-11-datafusion-47.0.0.md

+                optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day.
+                (PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)).
+              - **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out

Contributor

alamb Jul 11, 2025

content/blog/2025-07-11-datafusion-47.0.0.md


		Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases.

		Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104)

Contributor

alamb Jul 11, 2025

FYI @zhuqi-lucas

alamb merged commit 0ba1f82 into apache:main

1 check passed

Contributor

alamb commented Jul 11, 2025

The blog is live! https://datafusion.apache.org/blog/2025/07/11/datafusion-47.0.0/

This was referenced Jul 12, 2025

Blog post for the DataFusion 48 release apache/datafusion#16757

Closed

WIP Blog post for Datafusion 47.0.0 #70

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet