|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Apache DataFusion 47.0.0 Released |
| 4 | +date: 2025-07-11 |
| 5 | +author: PMC |
| 6 | +categories: [ release ] |
| 7 | +--- |
| 8 | + |
| 9 | +<!-- |
| 10 | +{% comment %} |
| 11 | +Licensed to the Apache Software Foundation (ASF) under one or more |
| 12 | +contributor license agreements. See the NOTICE file distributed with |
| 13 | +this work for additional information regarding copyright ownership. |
| 14 | +The ASF licenses this file to you under the Apache License, Version 2.0 |
| 15 | +(the "License"); you may not use this file except in compliance with |
| 16 | +the License. You may obtain a copy of the License at |
| 17 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 18 | +Unless required by applicable law or agreed to in writing, software |
| 19 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 20 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 21 | +See the License for the specific language governing permissions and |
| 22 | +limitations under the License. |
| 23 | +{% endcomment %} |
| 24 | +--> |
| 25 | + |
| 26 | +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> |
| 27 | + |
| 28 | +We’re excited to announce the release of **Apache DataFusion 47.0.0**! This new version represents a significant |
| 29 | +milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the |
| 30 | +full [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md). We’ll highlight the most |
| 31 | +important changes below and guide you through upgrading. |
| 32 | + |
| 33 | +Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to |
| 34 | +limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us |
| 35 | +accelerate the next release and announcements |
| 36 | +by [joining the community](https://datafusion.apache.org/contributor-guide/communication.html) 🎣. |
| 37 | + |
| 38 | +## Breaking Changes |
| 39 | + |
| 40 | +DataFusion 47.0.0 brings a few **breaking changes** that may require adjustments to your code as described in |
| 41 | +the [Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0). Here are some notable ones: |
| 42 | + |
| 43 | +- [Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0](https://github.com/apache/datafusion/pull/15466): |
| 44 | + Several APIs changed in the underlying `arrow`, `parquet` and `object_store` libraries to use a `u64` instead of usize to better support |
| 45 | + WASM. This requires converting from `usize` to `u64` occasionally as well as changes to ObjectStore implementations such as |
| 46 | +```Rust |
| 47 | +impl ObjectStore { |
| 48 | + ... |
| 49 | + |
| 50 | + // The range is now a u64 instead of usize |
| 51 | + async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> { |
| 52 | + self.inner.get_range(location, range).await |
| 53 | + } |
| 54 | + |
| 55 | + ... |
| 56 | + |
| 57 | + // the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references) |
| 58 | + // (this also applies to list_with_offset) |
| 59 | + fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> { |
| 60 | + self.inner.list(prefix) |
| 61 | + } |
| 62 | +} |
| 63 | +``` |
| 64 | +- [DisplayFormatType::TreeRender](https://github.com/apache/datafusion/issues/14914): |
| 65 | + Implementations of `ExecutionPlan` must also provide a description in the `DisplayFormatType::TreeRender` format to |
| 66 | + provide support for the new [tree style explains](https://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default). |
| 67 | + This can be the same as the existing `DisplayFormatType::Default`. |
| 68 | + |
| 69 | +## Performance Improvements |
| 70 | + |
| 71 | +DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy |
| 72 | +optimizations in this release: |
| 73 | + |
| 74 | +- **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266) |
| 75 | + and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)). |
| 76 | + |
| 77 | +- **`MIN`, `MAX` and `AVG` for Durations:** DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns. |
| 78 | + (PRs [#15322]( https://github.com/apache/datafusion/pull/15322) and [#15748](https://github.com/apache/datafusion/pull/15748) |
| 79 | + by [shruti2522](https://github.com/shruti2522)). |
| 80 | + |
| 81 | +- **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of |
| 82 | + the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in |
| 83 | + [significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases). |
| 84 | + (PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694) |
| 85 | + by [acking-you](https://github.com/acking-you)). |
| 86 | + |
| 87 | +- **TopK optimization for partially sorted input:** Previous versions of DataFusion implemented early termination |
| 88 | + optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. |
| 89 | + (PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)). |
| 90 | + |
| 91 | +- **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out |
| 92 | + (PR [#15454](https://github.com/apache/datafusion/pull/15454) by [zebsme](https://github.com/zebsme)). |
| 93 | + |
| 94 | +## Highlighted New Features |
| 95 | + |
| 96 | +### Tree style explains |
| 97 | + |
| 98 | +In previous releases the [EXPLAIN statement] results in a formatted table |
| 99 | +which is succinct and contains important details for implementers, but was often hard to read |
| 100 | +especially with queries that included joins or unions having multiple children. |
| 101 | + |
| 102 | +[EXPLAIN statement]: https://datafusion.apache.org/user-guide/sql/explain.html |
| 103 | + |
| 104 | +DataFusion 47.0.0 includes the new `EXPLAIN FORMAT TREE` (default in |
| 105 | +`datafusion-cli`) rendered in a visual tree style that is much easier to quickly |
| 106 | +understand. |
| 107 | + |
| 108 | +<!-- SQL setup |
| 109 | +create table t1(ti int) as values (1), (2), (3); |
| 110 | +create table t2(ti int) as values (1), (2), (3); |
| 111 | +--> |
| 112 | + |
| 113 | +Example of the new explain output: |
| 114 | +```sql |
| 115 | +> explain select * from t1 inner join t2 on t1.ti=t2.ti; |
| 116 | ++---------------+------------------------------------------------------------+ |
| 117 | +| plan_type | plan | |
| 118 | ++---------------+------------------------------------------------------------+ |
| 119 | +| physical_plan | ┌───────────────────────────┐ | |
| 120 | +| | │ CoalesceBatchesExec │ | |
| 121 | +| | │ -------------------- │ | |
| 122 | +| | │ target_batch_size: │ | |
| 123 | +| | │ 8192 │ | |
| 124 | +| | └─────────────┬─────────────┘ | |
| 125 | +| | ┌─────────────┴─────────────┐ | |
| 126 | +| | │ HashJoinExec │ | |
| 127 | +| | │ -------------------- ├──────────────┐ | |
| 128 | +| | │ on: (ti = ti) │ │ | |
| 129 | +| | └─────────────┬─────────────┘ │ | |
| 130 | +| | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ | |
| 131 | +| | │ DataSourceExec ││ DataSourceExec │ | |
| 132 | +| | │ -------------------- ││ -------------------- │ | |
| 133 | +| | │ bytes: 112 ││ bytes: 112 │ | |
| 134 | +| | │ format: memory ││ format: memory │ | |
| 135 | +| | │ rows: 1 ││ rows: 1 │ | |
| 136 | +| | └───────────────────────────┘└───────────────────────────┘ | |
| 137 | +| | | |
| 138 | ++---------------+------------------------------------------------------------+ |
| 139 | +``` |
| 140 | + |
| 141 | +Example of the `EXPLAIN FORMAT INDENT` output for the same query |
| 142 | +```sql |
| 143 | +> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti; |
| 144 | ++---------------+----------------------------------------------------------------------+ |
| 145 | +| plan_type | plan | |
| 146 | ++---------------+----------------------------------------------------------------------+ |
| 147 | +| logical_plan | Inner Join: t1.ti = t2.ti | |
| 148 | +| | TableScan: t1 projection=[ti] | |
| 149 | +| | TableScan: t2 projection=[ti] | |
| 150 | +| physical_plan | CoalesceBatchesExec: target_batch_size=8192 | |
| 151 | +| | HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] | |
| 152 | +| | DataSourceExec: partitions=1, partition_sizes=[1] | |
| 153 | +| | DataSourceExec: partitions=1, partition_sizes=[1] | |
| 154 | +| | | |
| 155 | ++---------------+----------------------------------------------------------------------+ |
| 156 | +2 row(s) fetched. |
| 157 | +``` |
| 158 | + |
| 159 | +Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677) |
| 160 | +and many others for completing the [followup epic](https://github.com/apache/datafusion/issues/14914) |
| 161 | + |
| 162 | +### SQL `VARCHAR` defaults to Utf8View |
| 163 | + |
| 164 | +In previous releases when a column was created in SQL the column would be mapped to the [Utf8 Arrow data type]. In this release |
| 165 | +the SQL `varchar` columns will be mapped to the [Utf8View arrow data type] by default, which is a more efficient representation of UTF-8 strings in Arrow. |
| 166 | + |
| 167 | +[Utf8 Arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8 |
| 168 | +[Utf8View arrow data type]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8View |
| 169 | + |
| 170 | +```sql |
| 171 | +create table foo(x varchar); |
| 172 | +0 row(s) fetched. |
| 173 | + |
| 174 | +> describe foo; |
| 175 | ++-------------+-----------+-------------+ |
| 176 | +| column_name | data_type | is_nullable | |
| 177 | ++-------------+-----------+-------------+ |
| 178 | +| x | Utf8View | YES | |
| 179 | ++-------------+-----------+-------------+ |
| 180 | +``` |
| 181 | + |
| 182 | +Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases. |
| 183 | + |
| 184 | +Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104) |
| 185 | + |
| 186 | +### Context propagation in spawned tasks (for tracing, logging, etc.) |
| 187 | + |
| 188 | +This release introduces an API for propagating user-defined context (such as tracing spans, |
| 189 | +logging, or metrics) across thread boundaries without depending on any specific instrumentation library. |
| 190 | +You can use the [JoinSetTracer] API to instrument DataFusion plans with your own tracing or logging libraries, or |
| 191 | +use pre-integrated community crates such as the [datafusion-tracing] crate. |
| 192 | + |
| 193 | +<div style="text-align: center;"> |
| 194 | + <a href="https://github.com/datafusion-contrib/datafusion-tracing"> |
| 195 | + <img |
| 196 | + src="/blog/images/datafusion-47.0.0/datafusion-telemetry.png" |
| 197 | + width="50%" |
| 198 | + class="img-responsive" |
| 199 | + alt="DataFusion telemetry project logo" |
| 200 | + /> |
| 201 | + </a> |
| 202 | +</div> |
| 203 | + |
| 204 | + |
| 205 | +[datafusion-tracing]: https://github.com/datafusion-contrib/datafusion-tracing |
| 206 | + |
| 207 | +Previously, tasks spawned on new threads — such as those performing |
| 208 | +repartitioning or Parquet file reads — could lose thread-local context, which is |
| 209 | +often used in instrumentation libraries. A full example of how to use this new |
| 210 | +API is available in the [DataFusion examples], and a simple example is shown below. |
| 211 | + |
| 212 | + |
| 213 | +[JoinSetTracer]: https://docs.rs/datafusion/latest/datafusion/common/runtime/trait.JoinSetTracer.html |
| 214 | +[DataFusion examples]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/tracing.rs |
| 215 | + |
| 216 | +```Rust |
| 217 | +/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state |
| 218 | +/// for the current span and must be called at the start of each new task or thread. |
| 219 | +struct SpanTracer; |
| 220 | + |
| 221 | +/// Implements the `JoinSetTracer` trait so we can inject instrumentation |
| 222 | +/// for both async futures and blocking closures. |
| 223 | +impl JoinSetTracer for SpanTracer { |
| 224 | + /// Instruments a boxed future to run in the current span. The future's |
| 225 | + /// return type is erased to `Box<dyn Any + Send>`, which we simply |
| 226 | + /// run inside the `Span::current()` context. |
| 227 | + fn trace_future( |
| 228 | + &self, |
| 229 | + fut: BoxFuture<'static, Box<dyn Any + Send>>, |
| 230 | + ) -> BoxFuture<'static, Box<dyn Any + Send>> { |
| 231 | + // Ensures any thread-local context is set in this future |
| 232 | + fut.in_current_span().boxed() |
| 233 | + } |
| 234 | + |
| 235 | + /// Instruments a boxed blocking closure by running it inside the |
| 236 | + /// `Span::current()` context. |
| 237 | + fn trace_block( |
| 238 | + &self, |
| 239 | + f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>, |
| 240 | + ) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> { |
| 241 | + let span = Span::current(); |
| 242 | + // Ensures any thread-local context is set for this closure |
| 243 | + Box::new(move || span.in_scope(f)) |
| 244 | + } |
| 245 | +} |
| 246 | + |
| 247 | +... |
| 248 | +set_join_set_tracer(&SpanTracer).expect("Failed to set tracer"); |
| 249 | +... |
| 250 | +``` |
| 251 | + |
| 252 | +Thanks to [geoffreyclaude](https://github.com/geoffreyclaude) for PR [#14914](https://github.com/apache/datafusion/issues/14914) |
| 253 | + |
| 254 | +## Upgrade Guide and Changelog |
| 255 | + |
| 256 | +Upgrading to 47.0.0 should be straightforward for most users, but do review |
| 257 | +the [Upgrade Guide for DataFusion 47.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0) for detailed |
| 258 | +steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the |
| 259 | +transition. For a comprehensive list of all changes, please refer to the [changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md) for 47.0.0. The changelog |
| 260 | +enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post. |
| 261 | + |
| 262 | +## Get Involved |
| 263 | + |
| 264 | +Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to |
| 265 | +take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. |
| 266 | +You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our |
| 267 | +community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to |
| 268 | +contribute, we’d love to hear from you. A list of open issues suitable for beginners |
| 269 | +is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and you |
| 270 | +can find how to reach us on the [communication doc](https://datafusion.apache.org/contributor-guide/communication.html). |
| 271 | + |
| 272 | +Happy querying! |
0 commit comments