Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
0d5866f
DF 45 blog post
Omega359 Feb 20, 2025
7417e4c
Update content/blog/2025-02-20-datafusion-45.0.0.md
Omega359 Feb 21, 2025
3799609
Update content/blog/2025-02-20-datafusion-45.0.0.md
Omega359 Feb 21, 2025
88d7b6b
Set author to PMC.
Omega359 Feb 22, 2025
5a13332
Set author to PMC, incorporated feedback.
Omega359 Feb 22, 2025
b8de014
Update content/blog/2025-02-20-datafusion-45.0.0.md
Omega359 Feb 22, 2025
8a46200
expanded GSOC as it may not be obvious what it is and linked it up.
Omega359 Feb 22, 2025
5460e1a
Grammar fix.
Omega359 Feb 22, 2025
a2e3503
Typo fix
Omega359 Feb 22, 2025
e8e6734
Typo fix
Omega359 Feb 22, 2025
7bb8713
Adding spark functions to looking ahead section
Omega359 Feb 22, 2025
b330523
minor change
Omega359 Feb 22, 2025
3b9b11d
Fixed Jonah Gao's handle.
Omega359 Feb 24, 2025
29af566
Update content/blog/2025-02-20-datafusion-45.0.0.md
alamb Feb 25, 2025
d49a65c
WIP for DF 49 blog post.
Omega359 Jul 1, 2025
0167406
WIP for DF 49 blog post.
Omega359 Jul 1, 2025
6102ade
Update topK dynamic filtering perf section, cleanup the upgrade and c…
Omega359 Jul 1, 2025
cbbad27
Merge remote-tracking branch 'upstream/main' into origin_main
Omega359 Jul 6, 2025
3ece66e
DF 47.0.0 blog post
Omega359 Jul 6, 2025
ef46a35
Remove incomplete and accidentally added DF 49 blog post
Omega359 Jul 6, 2025
9e0f4e1
Fix header.
Omega359 Jul 6, 2025
4f049f4
Grammar fix
Omega359 Jul 6, 2025
9780da0
Minor formatting
Omega359 Jul 6, 2025
cdf50f8
Adding disabling of re-validation of spill files to performance impro…
Omega359 Jul 6, 2025
1478e3d
Merge remote-tracking branch 'apache/main' into Omega359/main
alamb Jul 9, 2025
3859c80
Formatting and wordsmithing
alamb Jul 9, 2025
a149ac8
tweaks
alamb Jul 9, 2025
d433e7d
Update content/blog/2025-07-10-datafusion-47.0.0.md
alamb Jul 9, 2025
af1c645
Fixed link.
Omega359 Jul 9, 2025
88ad7c6
Add datafusion-tracing crate mention and logo, make text more concrete
alamb Jul 11, 2025
5f17467
Claude edits
alamb Jul 11, 2025
44aa48f
Update publishing date
alamb Jul 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
272 changes: 272 additions & 0 deletions content/blog/2025-07-11-datafusion-47.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
---
layout: post
title: Apache DataFusion 47.0.0 Released
date: 2025-07-11
author: PMC
categories: [ release ]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

<!-- see https://github.com/apache/datafusion/issues/16347 for details -->

We’re excited to announce the release of **Apache DataFusion 47.0.0**! This new version represents a significant
milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the
full [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md). We’ll highlight the most
important changes below and guide you through upgrading.

Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to
limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us
accelerate the next release and announcements
by [joining the community](https://datafusion.apache.org/contributor-guide/communication.html) 🎣.

## Breaking Changes

DataFusion 47.0.0 brings a few **breaking changes** that may require adjustments to your code as described in
the [Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0). Here are some notable ones:

- [Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0](https://github.com/apache/datafusion/pull/15466):
Several APIs changed in the underlying `arrow`, `parquet` and `object_store` libraries to use a `u64` instead of usize to better support
WASM. This requires converting from `usize` to `u64` occasionally as well as changes to ObjectStore implementations such as
```Rust
impl ObjectStore {
...

// The range is now a u64 instead of usize
async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
self.inner.get_range(location, range).await
}

...

// the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
// (this also applies to list_with_offset)
fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
self.inner.list(prefix)
}
}
```
- [DisplayFormatType::TreeRender](https://github.com/apache/datafusion/issues/14914):
Implementations of `ExecutionPlan` must also provide a description in the `DisplayFormatType::TreeRender` format to
provide support for the new [tree style explains](https://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default).
This can be the same as the existing `DisplayFormatType::Default`.

## Performance Improvements

DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy
optimizations in this release:

- **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @UBarney

and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)).

- **`MIN`, `MAX` and `AVG` for Durations:** DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(PRs [#15322]( https://github.com/apache/datafusion/pull/15322) and [#15748](https://github.com/apache/datafusion/pull/15748)
by [shruti2522](https://github.com/shruti2522)).

- **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of
the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in
[significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases).
(PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by [acking-you](https://github.com/acking-you)).

- **TopK optimization for partially sorted input:** Previous versions of DataFusion implemented early termination
optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day.
(PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)).

- **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @zebsme

(PR [#15454](https://github.com/apache/datafusion/pull/15454) by [zebsme](https://github.com/zebsme)).

## Highlighted New Features

### Tree style explains

In previous releases the [EXPLAIN statement] results in a formatted table
which is succinct and contains important details for implementers, but was often hard to read
especially with queries that included joins or unions having multiple children.

[EXPLAIN statement]: https://datafusion.apache.org/user-guide/sql/explain.html

DataFusion 47.0.0 includes the new `EXPLAIN FORMAT TREE` (default in
`datafusion-cli`) rendered in a visual tree style that is much easier to quickly
understand.

<!-- SQL setup
create table t1(ti int) as values (1), (2), (3);
create table t2(ti int) as values (1), (2), (3);
-->

Example of the new explain output:
```sql
> explain select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ HashJoinExec │ |
| | │ -------------------- ├──────────────┐ |
| | │ on: (ti = ti) │ │ |
| | └─────────────┬─────────────┘ │ |
| | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
| | │ DataSourceExec ││ DataSourceExec │ |
| | │ -------------------- ││ -------------------- │ |
| | │ bytes: 112 ││ bytes: 112 │ |
| | │ format: memory ││ format: memory │ |
| | │ rows: 1 ││ rows: 1 │ |
| | └───────────────────────────┘└───────────────────────────┘ |
| | |
+---------------+------------------------------------------------------------+
```

Example of the `EXPLAIN FORMAT INDENT` output for the same query
```sql
> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+----------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------+
| logical_plan | Inner Join: t1.ti = t2.ti |
| | TableScan: t1 projection=[ti] |
| | TableScan: t2 projection=[ti] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------------------+
2 row(s) fetched.
```

Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @irenjj

and many others for completing the [followup epic](https://github.com/apache/datafusion/issues/14914)

### SQL `VARCHAR` defaults to Utf8View

In previous releases when a column was created in SQL the column would be mapped to the [Utf8 Arrow data type]. In this release
the SQL `varchar` columns will be mapped to the [Utf8View arrow data type] by default, which is a more efficient representation of UTF-8 strings in Arrow.

[Utf8 Arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8
[Utf8View arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8View

```sql
create table foo(x varchar);
0 row(s) fetched.

> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x | Utf8View | YES |
+-------------+-----------+-------------+
```

Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases.

Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Context propagation in spawned tasks (for tracing, logging, etc.)

This release introduces an API for propagating user-defined context (such as tracing spans,
logging, or metrics) across thread boundaries without depending on any specific instrumentation library.
You can use the [JoinSetTracer] API to instrument DataFusion plans with your own tracing or logging libraries, or
use pre-integrated community crates such as the [datafusion-tracing] crate.

<div style="text-align: center;">
<a href="https://github.com/datafusion-contrib/datafusion-tracing">
<img
src="/blog/images/datafusion-47.0.0/datafusion-telemetry.png"
width="50%"
class="img-responsive"
alt="DataFusion telemetry project logo"
/>
</a>
</div>


[datafusion-tracing]: https://github.com/datafusion-contrib/datafusion-tracing

Previously, tasks spawned on new threads — such as those performing
repartitioning or Parquet file reads — could lose thread-local context, which is
often used in instrumentation libraries. A full example of how to use this new
API is available in the [DataFusion examples], and a simple example is shown below.


[JoinSetTracer]: https://docs.rs/datafusion/latest/datafusion/common/runtime/trait.JoinSetTracer.html
[DataFusion examples]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/tracing.rs

```Rust
/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
/// for the current span and must be called at the start of each new task or thread.
struct SpanTracer;

/// Implements the `JoinSetTracer` trait so we can inject instrumentation
/// for both async futures and blocking closures.
impl JoinSetTracer for SpanTracer {
/// Instruments a boxed future to run in the current span. The future's
/// return type is erased to `Box<dyn Any + Send>`, which we simply
/// run inside the `Span::current()` context.
fn trace_future(
&self,
fut: BoxFuture<'static, Box<dyn Any + Send>>,
) -> BoxFuture<'static, Box<dyn Any + Send>> {
// Ensures any thread-local context is set in this future
fut.in_current_span().boxed()
}

/// Instruments a boxed blocking closure by running it inside the
/// `Span::current()` context.
fn trace_block(
&self,
f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
let span = Span::current();
// Ensures any thread-local context is set for this closure
Box::new(move || span.in_scope(f))
}
}

...
set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
...
```

Thanks to [geoffreyclaude](https://github.com/geoffreyclaude) for PR [#14914](https://github.com/apache/datafusion/issues/14914)

## Upgrade Guide and Changelog

Upgrading to 47.0.0 should be straightforward for most users, but do review
the [Upgrade Guide for DataFusion 47.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0) for detailed
steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the
transition. For a comprehensive list of all changes, please refer to the [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md) for 47.0.0. The changelog
enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

## Get Involved

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to
take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions.
You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our
community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to
contribute, we’d love to hear from you. A list of open issues suitable for beginners
is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and you
can find how to reach us on the [communication doc](https://datafusion.apache.org/contributor-guide/communication.html).

Happy querying!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.