apache · alamb · Jul 11, 2025 · Feb 20, 2025 · Feb 21, 2025 · Feb 21, 2025
diff --git a/content/blog/2025-07-11-datafusion-47.0.0.md b/content/blog/2025-07-11-datafusion-47.0.0.md
@@ -0,0 +1,272 @@
+---
+layout: post
+title: Apache DataFusion 47.0.0 Released
+date: 2025-07-11
+author: PMC
+categories: [ release ]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+We’re excited to announce the release of **Apache DataFusion 47.0.0**! This new version represents a significant
+milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the
+full [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md). We’ll highlight the most
+important changes below and guide you through upgrading.
+
+Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to 
+limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us
+accelerate the next release and announcements 
+by [joining the community](https://datafusion.apache.org/contributor-guide/communication.html)  🎣.
+
+## Breaking Changes
+
+DataFusion 47.0.0 brings a few **breaking changes** that may require adjustments to your code as described in
+the [Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0). Here are some notable ones:
+
+- [Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0](https://github.com/apache/datafusion/pull/15466):
+  Several APIs changed in the underlying `arrow`, `parquet` and `object_store` libraries to use a `u64` instead of usize to better support
+  WASM. This requires converting from `usize` to `u64` occasionally as well as changes to ObjectStore implementations such as
+```Rust
+impl ObjectStore {
+    ...
+
+    // The range is now a u64 instead of usize
+    async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
+        self.inner.get_range(location, range).await
+    }
+
+    ...
+
+    // the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
+    // (this also applies to list_with_offset)
+    fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
+        self.inner.list(prefix)
+    }
+}
+```
+- [DisplayFormatType::TreeRender](https://github.com/apache/datafusion/issues/14914):
+  Implementations of `ExecutionPlan` must also provide a description in the `DisplayFormatType::TreeRender` format to
+  provide support for the new [tree style explains](https://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default).
+  This can be the same as the existing `DisplayFormatType::Default`.
+
+## Performance Improvements
+
+DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy
+optimizations in this release:
+
+- **`FIRST_VALUE` and `LAST_VALUE`:**  `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266)
+  and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)).
+
+- **`MIN`, `MAX` and `AVG` for Durations:**  DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns. 
+  (PRs [#15322](  https://github.com/apache/datafusion/pull/15322) and [#15748](https://github.com/apache/datafusion/pull/15748)
+  by [shruti2522](https://github.com/shruti2522)).
+
+- **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of
+  the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in
+  [significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases).
+  (PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694)
+  by [acking-you](https://github.com/acking-you)).
+
+- **TopK optimization for partially sorted input:** Previous versions of DataFusion implemented early termination
+  optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. 
+  (PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)).
+
+- **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out
+  (PR [#15454](https://github.com/apache/datafusion/pull/15454) by [zebsme](https://github.com/zebsme)).
+
+## Highlighted New Features
+
+### Tree style explains
+
+In previous releases the [EXPLAIN statement] results in a formatted table
+which is succinct and contains important details for implementers, but was often hard to read
+especially with queries that included joins or unions having multiple children.
+
+[EXPLAIN statement]: https://datafusion.apache.org/user-guide/sql/explain.html
+
+DataFusion 47.0.0 includes the new `EXPLAIN FORMAT TREE` (default in
+`datafusion-cli`) rendered in a visual tree style that is much easier to quickly
+understand.
+
+<!-- SQL setup 
+create table t1(ti int) as values (1), (2), (3);
+create table t2(ti int) as values (1), (2), (3);
+-->
+
+Example of the new explain output:
+```sql
+> explain select * from t1 inner join t2 on t1.ti=t2.ti;
++---------------+------------------------------------------------------------+
+| plan_type     | plan                                                       |
++---------------+------------------------------------------------------------+
+| physical_plan | ┌───────────────────────────┐                              |
+|               | │    CoalesceBatchesExec    │                              |
+|               | │    --------------------   │                              |
+|               | │     target_batch_size:    │                              |
+|               | │            8192           │                              |
+|               | └─────────────┬─────────────┘                              |
+|               | ┌─────────────┴─────────────┐                              |
+|               | │        HashJoinExec       │                              |
+|               | │    --------------------   ├──────────────┐               |
+|               | │       on: (ti = ti)       │              │               |
+|               | └─────────────┬─────────────┘              │               |
+|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
+|               | │       DataSourceExec      ││       DataSourceExec      │ |
+|               | │    --------------------   ││    --------------------   │ |
+|               | │         bytes: 112        ││         bytes: 112        │ |
+|               | │       format: memory      ││       format: memory      │ |
+|               | │          rows: 1          ││          rows: 1          │ |
+|               | └───────────────────────────┘└───────────────────────────┘ |
+|               |                                                            |
++---------------+------------------------------------------------------------+
+```
+
+Example of the `EXPLAIN FORMAT INDENT` output for the same query
+```sql
+> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
++---------------+----------------------------------------------------------------------+
+| plan_type     | plan                                                                 |
++---------------+----------------------------------------------------------------------+
+| logical_plan  | Inner Join: t1.ti = t2.ti                                            |
+|               |   TableScan: t1 projection=[ti]                                      |
+|               |   TableScan: t2 projection=[ti]                                      |
+| physical_plan | CoalesceBatchesExec: target_batch_size=8192                          |
+|               |   HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
+|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
+|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
+|               |                                                                      |
++---------------+----------------------------------------------------------------------+
+2 row(s) fetched.
+```
+
+Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677)
+and many others for completing the [followup epic](https://github.com/apache/datafusion/issues/14914)
+
+### SQL `VARCHAR` defaults to Utf8View
+
+In previous releases when a column was created in SQL the column would be mapped to the [Utf8 Arrow data type]. In this release
+the SQL `varchar` columns will be mapped to the [Utf8View arrow data type] by default, which is a more efficient representation of UTF-8 strings in Arrow.
+
+[Utf8 Arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8
+[Utf8View arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8View
+
+```sql
+create table foo(x varchar);
+0 row(s) fetched.
+
+> describe foo;
++-------------+-----------+-------------+
+| column_name | data_type | is_nullable |
++-------------+-----------+-------------+
+| x           | Utf8View  | YES         |
++-------------+-----------+-------------+
+```
+
+Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases.
+
+Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104)
+
+### Context propagation in spawned tasks (for tracing, logging, etc.)
+
+This release introduces an API for propagating user-defined context (such as tracing spans,
+logging, or metrics) across thread boundaries without depending on any specific instrumentation library.
+You can use the [JoinSetTracer] API to instrument DataFusion plans with your own tracing or logging libraries, or
+use pre-integrated community crates such as the [datafusion-tracing] crate.
+
+<div style="text-align: center;">
+  <a href="https://github.com/datafusion-contrib/datafusion-tracing">
+    <img
+      src="/blog/images/datafusion-47.0.0/datafusion-telemetry.png"
+      width="50%"
+      class="img-responsive"
+      alt="DataFusion telemetry project logo"
+    />
+    </a>
+</div>
+
+
+[datafusion-tracing]: https://github.com/datafusion-contrib/datafusion-tracing
+
+Previously, tasks spawned on new threads — such as those performing
+repartitioning or Parquet file reads — could lose thread-local context, which is
+often used in instrumentation libraries. A full example of how to use this new
+API is available in the [DataFusion examples], and a simple example is shown below.
+
+
+[JoinSetTracer]: https://docs.rs/datafusion/latest/datafusion/common/runtime/trait.JoinSetTracer.html
+[DataFusion examples]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/tracing.rs
+
+```Rust
+/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
+/// for the current span and must be called at the start of each new task or thread.
+struct SpanTracer;
+
+/// Implements the `JoinSetTracer` trait so we can inject instrumentation
+/// for both async futures and blocking closures.
+impl JoinSetTracer for SpanTracer {
+    /// Instruments a boxed future to run in the current span. The future's
+    /// return type is erased to `Box<dyn Any + Send>`, which we simply
+    /// run inside the `Span::current()` context.
+    fn trace_future(
+        &self,
+        fut: BoxFuture<'static, Box<dyn Any + Send>>,
+    ) -> BoxFuture<'static, Box<dyn Any + Send>> {
+        // Ensures any thread-local context is set in this future 
+        fut.in_current_span().boxed()
+    }
+
+    /// Instruments a boxed blocking closure by running it inside the
+    /// `Span::current()` context.
+    fn trace_block(
+        &self,
+        f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
+    ) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
+        let span = Span::current();
+        // Ensures any thread-local context is set for this closure
+        Box::new(move || span.in_scope(f))
+    }
+}
+
+...
+set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
+...
+```
+
+Thanks to [geoffreyclaude](https://github.com/geoffreyclaude) for PR [#14914](https://github.com/apache/datafusion/issues/14914)
+
+## Upgrade Guide and Changelog
+
+Upgrading to 47.0.0 should be straightforward for most users, but do review
+the [Upgrade Guide for DataFusion 47.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0) for detailed
+steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the
+transition. For a comprehensive list of all changes, please refer to the [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md) for 47.0.0. The changelog
+enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.
+
+## Get Involved
+
+Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to
+take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions.
+You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our
+community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to
+contribute, we’d love to hear from you. A list of open issues suitable for beginners
+is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and you
+can find how to reach us on the [communication doc](https://datafusion.apache.org/contributor-guide/communication.html).
+
+Happy querying!
diff --git a/content/images/datafusion-47.0.0/datafusion-telemetry.png b/content/images/datafusion-47.0.0/datafusion-telemetry.png