Skip to content

Conversation

@Omega359
Copy link
Contributor

@Omega359 Omega359 commented Jul 6, 2025

First cut at a DF 47 blog post as

Please let me know of anything you wish to add/modify

@alamb
Copy link
Contributor

alamb commented Jul 9, 2025

Starting to check this out

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 -- this is great. I pushed some wording updates and formatting fixes

Was looking like this

Screenshot 2025-07-09 at 7 26 25 AM

Which now looks like this

Screenshot 2025-07-09 at 8 17 51 AM

Thanks again -- this is super great

@Omega359
Copy link
Contributor Author

Omega359 commented Jul 9, 2025

Thanks. Odd that RustRover rendered it differently but the wording is definitely better :)

@alamb
Copy link
Contributor

alamb commented Jul 9, 2025

Thanks. Odd that RustRover rendered it differently but the wording is definitely better :)

Yeah, the pelicanasf rendered is pretty wonky and non standard (also doesn't like markdown tables for some reason 🤷 )

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! one nit


Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104)

### Tracing context propagation in spawned tasks
Copy link

@geoffreyclaude geoffreyclaude Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for mentioning this change! You can also maybe link to the related datafusion-contrib repo https://github.com/datafusion-contrib/datafusion-tracing which builds upon this, otherwise the description might be a bit too abstract :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding this now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the post much stronger

2 row(s) fetched.
```

Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @irenjj

DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy
optimizations in this release:

- **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @UBarney

- **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266)
and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)).

- **`MIN`, `MAX` and `AVG` for Durations:** DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @Omega359 -- let's get this published as we have a backlog of content ready to rock and I don't want to drop too many blogs the same day!

- **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of
the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in
[significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases).
(PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day.
(PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)).

- **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @zebsme


Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases.

Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb merged commit 0ba1f82 into apache:main Jul 11, 2025
1 check passed
@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

The blog is live! https://datafusion.apache.org/blog/2025/07/11/datafusion-47.0.0/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants