Skip to content

Commit 0ba1f82

Browse files
Omega3592010YOUY01alambphillipleblanckevinjqliu
authored
DataFusion 47.0.0 blog post (#83)
* DF 45 blog post * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Yongting You <[email protected]> * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Yongting You <[email protected]> * Set author to PMC. * Set author to PMC, incorporated feedback. * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Andrew Lamb <[email protected]> * expanded GSOC as it may not be obvious what it is and linked it up. * Grammar fix. * Typo fix * Typo fix * Adding spark functions to looking ahead section * minor change * Fixed Jonah Gao's handle. * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Phillip LeBlanc <[email protected]> * WIP for DF 49 blog post. * WIP for DF 49 blog post. * Update topK dynamic filtering perf section, cleanup the upgrade and changelog section * DF 47.0.0 blog post * Remove incomplete and accidentally added DF 49 blog post * Fix header. * Grammar fix * Minor formatting * Adding disabling of re-validation of spill files to performance improvements * Formatting and wordsmithing * tweaks * Update content/blog/2025-07-10-datafusion-47.0.0.md Co-authored-by: Kevin Liu <[email protected]> * Fixed link. * Add datafusion-tracing crate mention and logo, make text more concrete * Claude edits * Update publishing date --------- Co-authored-by: Yongting You <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Phillip LeBlanc <[email protected]> Co-authored-by: Kevin Liu <[email protected]>
1 parent 4c7a5c5 commit 0ba1f82

File tree

2 files changed

+272
-0
lines changed

2 files changed

+272
-0
lines changed
Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
---
2+
layout: post
3+
title: Apache DataFusion 47.0.0 Released
4+
date: 2025-07-11
5+
author: PMC
6+
categories: [ release ]
7+
---
8+
9+
<!--
10+
{% comment %}
11+
Licensed to the Apache Software Foundation (ASF) under one or more
12+
contributor license agreements. See the NOTICE file distributed with
13+
this work for additional information regarding copyright ownership.
14+
The ASF licenses this file to you under the Apache License, Version 2.0
15+
(the "License"); you may not use this file except in compliance with
16+
the License. You may obtain a copy of the License at
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
Unless required by applicable law or agreed to in writing, software
19+
distributed under the License is distributed on an "AS IS" BASIS,
20+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
21+
See the License for the specific language governing permissions and
22+
limitations under the License.
23+
{% endcomment %}
24+
-->
25+
26+
<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
27+
28+
We’re excited to announce the release of **Apache DataFusion 47.0.0**! This new version represents a significant
29+
milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the
30+
full [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md). We’ll highlight the most
31+
important changes below and guide you through upgrading.
32+
33+
Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to
34+
limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us
35+
accelerate the next release and announcements
36+
by [joining the community](https://datafusion.apache.org/contributor-guide/communication.html) 🎣.
37+
38+
## Breaking Changes
39+
40+
DataFusion 47.0.0 brings a few **breaking changes** that may require adjustments to your code as described in
41+
the [Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0). Here are some notable ones:
42+
43+
- [Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0](https://github.com/apache/datafusion/pull/15466):
44+
Several APIs changed in the underlying `arrow`, `parquet` and `object_store` libraries to use a `u64` instead of usize to better support
45+
WASM. This requires converting from `usize` to `u64` occasionally as well as changes to ObjectStore implementations such as
46+
```Rust
47+
impl ObjectStore {
48+
...
49+
50+
// The range is now a u64 instead of usize
51+
async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
52+
self.inner.get_range(location, range).await
53+
}
54+
55+
...
56+
57+
// the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
58+
// (this also applies to list_with_offset)
59+
fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
60+
self.inner.list(prefix)
61+
}
62+
}
63+
```
64+
- [DisplayFormatType::TreeRender](https://github.com/apache/datafusion/issues/14914):
65+
Implementations of `ExecutionPlan` must also provide a description in the `DisplayFormatType::TreeRender` format to
66+
provide support for the new [tree style explains](https://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default).
67+
This can be the same as the existing `DisplayFormatType::Default`.
68+
69+
## Performance Improvements
70+
71+
DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy
72+
optimizations in this release:
73+
74+
- **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266)
75+
and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)).
76+
77+
- **`MIN`, `MAX` and `AVG` for Durations:** DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns.
78+
(PRs [#15322]( https://github.com/apache/datafusion/pull/15322) and [#15748](https://github.com/apache/datafusion/pull/15748)
79+
by [shruti2522](https://github.com/shruti2522)).
80+
81+
- **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of
82+
the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in
83+
[significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases).
84+
(PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694)
85+
by [acking-you](https://github.com/acking-you)).
86+
87+
- **TopK optimization for partially sorted input:** Previous versions of DataFusion implemented early termination
88+
optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day.
89+
(PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)).
90+
91+
- **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out
92+
(PR [#15454](https://github.com/apache/datafusion/pull/15454) by [zebsme](https://github.com/zebsme)).
93+
94+
## Highlighted New Features
95+
96+
### Tree style explains
97+
98+
In previous releases the [EXPLAIN statement] results in a formatted table
99+
which is succinct and contains important details for implementers, but was often hard to read
100+
especially with queries that included joins or unions having multiple children.
101+
102+
[EXPLAIN statement]: https://datafusion.apache.org/user-guide/sql/explain.html
103+
104+
DataFusion 47.0.0 includes the new `EXPLAIN FORMAT TREE` (default in
105+
`datafusion-cli`) rendered in a visual tree style that is much easier to quickly
106+
understand.
107+
108+
<!-- SQL setup
109+
create table t1(ti int) as values (1), (2), (3);
110+
create table t2(ti int) as values (1), (2), (3);
111+
-->
112+
113+
Example of the new explain output:
114+
```sql
115+
> explain select * from t1 inner join t2 on t1.ti=t2.ti;
116+
+---------------+------------------------------------------------------------+
117+
| plan_type | plan |
118+
+---------------+------------------------------------------------------------+
119+
| physical_plan | ┌───────────────────────────┐ |
120+
| | │ CoalesceBatchesExec │ |
121+
| | │ -------------------- │ |
122+
| | │ target_batch_size: │ |
123+
| | │ 8192 │ |
124+
| | └─────────────┬─────────────┘ |
125+
| | ┌─────────────┴─────────────┐ |
126+
| | │ HashJoinExec │ |
127+
| | │ -------------------- ├──────────────┐ |
128+
| | │ on: (ti = ti) │ │ |
129+
| | └─────────────┬─────────────┘ │ |
130+
| | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
131+
| | │ DataSourceExec ││ DataSourceExec │ |
132+
| | │ -------------------- ││ -------------------- │ |
133+
| | │ bytes: 112 ││ bytes: 112 │ |
134+
| | │ format: memory ││ format: memory │ |
135+
| | │ rows: 1 ││ rows: 1 │ |
136+
| | └───────────────────────────┘└───────────────────────────┘ |
137+
| | |
138+
+---------------+------------------------------------------------------------+
139+
```
140+
141+
Example of the `EXPLAIN FORMAT INDENT` output for the same query
142+
```sql
143+
> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
144+
+---------------+----------------------------------------------------------------------+
145+
| plan_type | plan |
146+
+---------------+----------------------------------------------------------------------+
147+
| logical_plan | Inner Join: t1.ti = t2.ti |
148+
| | TableScan: t1 projection=[ti] |
149+
| | TableScan: t2 projection=[ti] |
150+
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
151+
| | HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
152+
| | DataSourceExec: partitions=1, partition_sizes=[1] |
153+
| | DataSourceExec: partitions=1, partition_sizes=[1] |
154+
| | |
155+
+---------------+----------------------------------------------------------------------+
156+
2 row(s) fetched.
157+
```
158+
159+
Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677)
160+
and many others for completing the [followup epic](https://github.com/apache/datafusion/issues/14914)
161+
162+
### SQL `VARCHAR` defaults to Utf8View
163+
164+
In previous releases when a column was created in SQL the column would be mapped to the [Utf8 Arrow data type]. In this release
165+
the SQL `varchar` columns will be mapped to the [Utf8View arrow data type] by default, which is a more efficient representation of UTF-8 strings in Arrow.
166+
167+
[Utf8 Arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8
168+
[Utf8View arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8View
169+
170+
```sql
171+
create table foo(x varchar);
172+
0 row(s) fetched.
173+
174+
> describe foo;
175+
+-------------+-----------+-------------+
176+
| column_name | data_type | is_nullable |
177+
+-------------+-----------+-------------+
178+
| x | Utf8View | YES |
179+
+-------------+-----------+-------------+
180+
```
181+
182+
Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases.
183+
184+
Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104)
185+
186+
### Context propagation in spawned tasks (for tracing, logging, etc.)
187+
188+
This release introduces an API for propagating user-defined context (such as tracing spans,
189+
logging, or metrics) across thread boundaries without depending on any specific instrumentation library.
190+
You can use the [JoinSetTracer] API to instrument DataFusion plans with your own tracing or logging libraries, or
191+
use pre-integrated community crates such as the [datafusion-tracing] crate.
192+
193+
<div style="text-align: center;">
194+
<a href="https://github.com/datafusion-contrib/datafusion-tracing">
195+
<img
196+
src="/blog/images/datafusion-47.0.0/datafusion-telemetry.png"
197+
width="50%"
198+
class="img-responsive"
199+
alt="DataFusion telemetry project logo"
200+
/>
201+
</a>
202+
</div>
203+
204+
205+
[datafusion-tracing]: https://github.com/datafusion-contrib/datafusion-tracing
206+
207+
Previously, tasks spawned on new threads — such as those performing
208+
repartitioning or Parquet file reads — could lose thread-local context, which is
209+
often used in instrumentation libraries. A full example of how to use this new
210+
API is available in the [DataFusion examples], and a simple example is shown below.
211+
212+
213+
[JoinSetTracer]: https://docs.rs/datafusion/latest/datafusion/common/runtime/trait.JoinSetTracer.html
214+
[DataFusion examples]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/tracing.rs
215+
216+
```Rust
217+
/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
218+
/// for the current span and must be called at the start of each new task or thread.
219+
struct SpanTracer;
220+
221+
/// Implements the `JoinSetTracer` trait so we can inject instrumentation
222+
/// for both async futures and blocking closures.
223+
impl JoinSetTracer for SpanTracer {
224+
/// Instruments a boxed future to run in the current span. The future's
225+
/// return type is erased to `Box<dyn Any + Send>`, which we simply
226+
/// run inside the `Span::current()` context.
227+
fn trace_future(
228+
&self,
229+
fut: BoxFuture<'static, Box<dyn Any + Send>>,
230+
) -> BoxFuture<'static, Box<dyn Any + Send>> {
231+
// Ensures any thread-local context is set in this future
232+
fut.in_current_span().boxed()
233+
}
234+
235+
/// Instruments a boxed blocking closure by running it inside the
236+
/// `Span::current()` context.
237+
fn trace_block(
238+
&self,
239+
f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
240+
) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
241+
let span = Span::current();
242+
// Ensures any thread-local context is set for this closure
243+
Box::new(move || span.in_scope(f))
244+
}
245+
}
246+
247+
...
248+
set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
249+
...
250+
```
251+
252+
Thanks to [geoffreyclaude](https://github.com/geoffreyclaude) for PR [#14914](https://github.com/apache/datafusion/issues/14914)
253+
254+
## Upgrade Guide and Changelog
255+
256+
Upgrading to 47.0.0 should be straightforward for most users, but do review
257+
the [Upgrade Guide for DataFusion 47.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0) for detailed
258+
steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the
259+
transition. For a comprehensive list of all changes, please refer to the [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md) for 47.0.0. The changelog
260+
enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.
261+
262+
## Get Involved
263+
264+
Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to
265+
take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions.
266+
You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our
267+
community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to
268+
contribute, we’d love to hear from you. A list of open issues suitable for beginners
269+
is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and you
270+
can find how to reach us on the [communication doc](https://datafusion.apache.org/contributor-guide/communication.html).
271+
272+
Happy querying!
139 KB
Loading

0 commit comments

Comments
 (0)