|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Implementing User Defined Types and Custom Metadata in DataFusion |
| 4 | +date: 2025-09-21 |
| 5 | +author: Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData) |
| 6 | +categories: [core] |
| 7 | +--- |
| 8 | + |
| 9 | +<!-- |
| 10 | +{% comment %} |
| 11 | +Licensed to the Apache Software Foundation (ASF) under one or more |
| 12 | +contributor license agreements. See the NOTICE file distributed with |
| 13 | +this work for additional information regarding copyright ownership. |
| 14 | +The ASF licenses this file to you under the Apache License, Version 2.0 |
| 15 | +(the "License"); you may not use this file except in compliance with |
| 16 | +the License. You may obtain a copy of the License at |
| 17 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 18 | +Unless required by applicable law or agreed to in writing, software |
| 19 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 20 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 21 | +See the License for the specific language governing permissions and |
| 22 | +limitations under the License. |
| 23 | +{% endcomment %}x |
| 24 | +--> |
| 25 | + |
| 26 | +[TOC] |
| 27 | + |
| 28 | +[Apache DataFusion] [version 48.0.0] significantly improves support for user |
| 29 | +defined types and metadata. The user defined function APIs let users access |
| 30 | +metadata on the input columns to functions and produce metadata in the output. |
| 31 | + |
| 32 | +## User defined types == extension types |
| 33 | + |
| 34 | +DataFusion directly uses [Apache Arrow]'s [DataTypes] as its type system. This |
| 35 | +has several benefits including being simple to explain, supports a rich set of |
| 36 | +both scalar and nested types, true zero copy interoperability with other Arrow |
| 37 | +implementations, and world-class library support (via [arrow-rs]). However, one |
| 38 | +challenge of directly using the Arrow type system is there is no distinction |
| 39 | +between logical types and physical types. For example, the Arrow type system |
| 40 | +contains multiple types which can store "String"s (sequences of UTF8 encoded |
| 41 | +bytes) such as `Utf8`, `LargeUTF8`, `Dictionary(Utf8)`, and `Utf8View`. |
| 42 | + |
| 43 | +However, Apache Arrow does provide [extension types], a version of logical type |
| 44 | +information, which describe how to interpret data stored in one of the existing |
| 45 | +physical types. With the improved support for metadata in DataFusion 48.0.0, it |
| 46 | +is now easier to implement user defined types using Arrow extension types. |
| 47 | + |
| 48 | +## Metadata in Apache Arrow `Field`s |
| 49 | + |
| 50 | +The [Arrow specification] defines Metadata as a map of key-value pairs of |
| 51 | +strings. This metadata is used to attach extension types and use case-specific |
| 52 | +context to a column of values. The Rust implementation of Apache Arrow, |
| 53 | +[arrow-rs], stores metadata on [Field]s, but prior to DataFusion 48.0.0, many of |
| 54 | +DataFusion's internal APIs used [DataTypes] directly, and thus did not propagate |
| 55 | +metadata through all operations. |
| 56 | + |
| 57 | +[Arrow specification]: https://arrow.apache.org/docs/format/Columnar.html |
| 58 | +[Field]: https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html |
| 59 | +[DataTypes]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html |
| 60 | + |
| 61 | +In previous versions of DataFusion `Field` metadata was propagated through certain |
| 62 | +operations (e.g., renaming or selecting a column) but was not |
| 63 | +others (e.g., scalar, window, or aggregate function calls). In DataFusion 48.0.0, |
| 64 | +and later, all user defined functions are passed the full |
| 65 | +input `Field` information and can return `Field` information to the caller. |
| 66 | + |
| 67 | +Supporting extension types was a key motivation for adding metadata to the |
| 68 | +function processing, the same mechanism can store arbitrary metadata on the |
| 69 | +input and output fields, which supports other interesting use cases as we describe |
| 70 | +later in this post. |
| 71 | + |
| 72 | +[Apache DataFusion]: https://datafusion.apache.org |
| 73 | +[Apache Arrow]: https://arrow.apache.org |
| 74 | +[Field]: https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html |
| 75 | +[version 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ |
| 76 | +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types |
| 77 | +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types |
| 78 | +[arrow-rs]: https://crates.io/crates/arrow |
| 79 | + |
| 80 | +## Metadata handling |
| 81 | + |
| 82 | +Data in Arrow record batches carry a [Schema] in addition to the Arrow arrays. Each |
| 83 | +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The |
| 84 | +metadata is specified as a map of key-value pairs of strings. In the new |
| 85 | +implementation, during processing of all user defined functions we pass the input |
| 86 | +field information. |
| 87 | + |
| 88 | +<figure> |
| 89 | + <img src="/blog/images/metadata-handling/arrow_record_batch.png" alt="Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns." width="100%" class="img-responsive"> |
| 90 | + <figcaption> |
| 91 | + <b>Figure 1:</b> Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. |
| 92 | + </figcaption> |
| 93 | +</figure> |
| 94 | + |
| 95 | +[Schema]: https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html |
| 96 | + |
| 97 | +It is often desirable to write a generic function for reuse. Prior versions of |
| 98 | +user defined functions only had access to the `DataType` of the input columns. |
| 99 | +This works well for some features that only rely on the types of data, but other |
| 100 | +use cases may need additional information that describes the data. |
| 101 | + |
| 102 | +For example, suppose I wish to write a function that takes in a UUID and returns a string |
| 103 | +of the [variant] of the input field. We would want this function to be able to handle |
| 104 | +all of the string types and also a binary encoded UUID. The Arrow specification does not |
| 105 | +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary |
| 106 | +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] |
| 107 | +we can validate during planning that the input data not only has the correct underlying |
| 108 | +data type, but that it also represents the right *kind* of data. The UUID example is a |
| 109 | +common one, and it is included in the [canonical extension types] that are now |
| 110 | +supported in DataFusion. |
| 111 | + |
| 112 | +Another common application of metadata handling is understanding encoding of a blob of data. |
| 113 | +Suppose you have a column that contains image data. Most likely this data is stored as |
| 114 | +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, |
| 115 | +you cannot ensure you are using the correct methods for decoding it. You may work around |
| 116 | +this by adding another column to your data source indicating the encoding, but this can be |
| 117 | +wasteful for systems where the encoding never changes. Instead, you could use metadata to |
| 118 | +specify the encoding for the entire column. |
| 119 | + |
| 120 | +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field |
| 121 | +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 |
| 122 | +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html |
| 123 | + |
| 124 | +## How to use metadata in user defined functions |
| 125 | + |
| 126 | +When working with metadata for [user defined scalar functions], there are typically two |
| 127 | +places in the function definition that require implementation. |
| 128 | + |
| 129 | +- Computing the return field from the arguments |
| 130 | +- Invocation |
| 131 | + |
| 132 | +During planning, we will attempt to call the function [return_field_from_args()]. This will |
| 133 | +provide a list of input fields to the function and return the output field. To evaluate |
| 134 | +metadata on the input side, you can write a functions similar to this example: |
| 135 | + |
| 136 | +[user defined scalar functions]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html |
| 137 | +[return_field_from_args()]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.return_field_from_args |
| 138 | + |
| 139 | +```rust |
| 140 | +fn return_field_from_args( |
| 141 | + &self, |
| 142 | + args: ReturnFieldArgs, |
| 143 | +) -> datafusion::common::Result<FieldRef> { |
| 144 | + if args.arg_fields.len() != 1 { |
| 145 | + return exec_err!("Incorrect number of arguments for uuid_version"); |
| 146 | + } |
| 147 | + |
| 148 | + let input_field = &args.arg_fields[0]; |
| 149 | + if &DataType::FixedSizeBinary(16) == input_field.data_type() { |
| 150 | + let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type() |
| 151 | + else { |
| 152 | + return exec_err!("Input field must contain the UUID canonical extension type"); |
| 153 | + }; |
| 154 | + } |
| 155 | + |
| 156 | + let is_nullable = args.arg_fields[0].is_nullable(); |
| 157 | + |
| 158 | + Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable))) |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +In this example, we take advantage of the fact that we already have support for extension |
| 163 | +types that evaluate metadata. If you were attempting to check for metadata other than |
| 164 | +extension type support, we could have instead written a snippet such as: |
| 165 | + |
| 166 | +```rust |
| 167 | + if &DataType::FixedSizeBinary(16) == input_field.data_type() { |
| 168 | + let _ = input_field |
| 169 | + .metadata() |
| 170 | + .get("ARROW:extension:metadata") |
| 171 | + .ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?; |
| 172 | + }; |
| 173 | + } |
| 174 | +``` |
| 175 | + |
| 176 | +If you are writing a user defined function that will instead return metadata on output |
| 177 | +you can add this directly into the `Field` that is the output of the `return_field_from_args` |
| 178 | +call. In our above example, we could change the return line to: |
| 179 | + |
| 180 | +```rust |
| 181 | + Ok(Arc::new( |
| 182 | + Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata( |
| 183 | + [("my_key".to_string(), "my_value".to_string())] |
| 184 | + .into_iter() |
| 185 | + .collect(), |
| 186 | + ), |
| 187 | + )) |
| 188 | +``` |
| 189 | + |
| 190 | +By checking the metadata during the planning process, we can identify errors early in |
| 191 | +the query process. There are cases were we wish to have access to this metadata during |
| 192 | +execution as well. The function [invoke_with_args] in the user defined function takes |
| 193 | +the updated struct [ScalarFunctionArgs]. This now contains the input fields, which can |
| 194 | +be used to check for metadata. For example, you can do the following: |
| 195 | + |
| 196 | +[invoke_with_args]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.invoke_with_args |
| 197 | +[ScalarFunctionArgs]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html |
| 198 | + |
| 199 | +```rust |
| 200 | +fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { |
| 201 | + assert_eq!(args.arg_fields.len(), 1); |
| 202 | + let my_value = args.arg_fields[0] |
| 203 | + .metadata() |
| 204 | + .get("encoding_type"); |
| 205 | + ... |
| 206 | +``` |
| 207 | + |
| 208 | +In this snippet we have extracted an `Option<String>` from the input field metadata |
| 209 | +which we can then use to determine which functions we might want to call. We could |
| 210 | +then parse the returned value to determine what type of encoding to use when |
| 211 | +evaluating the array in the arguments. Since `return_field_from_args` is not `&mut self` |
| 212 | +this check could not be performed during the planning stage. |
| 213 | + |
| 214 | +The description in this section applies to scalar user defined functions, but equivalent |
| 215 | +support exists for aggregate and window functions. |
| 216 | + |
| 217 | +## Extension types |
| 218 | + |
| 219 | +Extension types are one of the primary motivations for this enhancement in |
| 220 | +[Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, [arrow-rs], |
| 221 | +already contains support for the [canonical extension types]. This support includes |
| 222 | +helper functions such as `try_canonical_extension_type()` in the earlier example. |
| 223 | + |
| 224 | +For a concrete example of how extension types can be used in DataFusion functions, |
| 225 | +there is an [example repository] that demonstrates using UUIDs. The UUID extension |
| 226 | +type specifies that the data are stored as a Fixed Size Binary of length 16. In the |
| 227 | +DataFusion core functions, we have the ability to generate string representations of |
| 228 | +UUIDs that match the version 4 specification. These are helpful, but a user may |
| 229 | +wish to do additional work with UUIDs where having them in the dense representation |
| 230 | +is preferable. Alternatively, the user may already have data with the binary encoding |
| 231 | +and we want to extract values such as the version, timestamp, or string |
| 232 | +representation. |
| 233 | + |
| 234 | +In the example repository we have created three user defined functions: `UuidVersion`, |
| 235 | +`StringToUuid`, and `UuidToString`. Each of these implements `ScalarUDFImpl` and can |
| 236 | +be used thusly: |
| 237 | + |
| 238 | +```rust |
| 239 | +async fn main() -> Result<()> { |
| 240 | + let ctx = create_context()?; |
| 241 | + |
| 242 | + // get a DataFrame from the context |
| 243 | + let mut df = ctx.table("t").await?; |
| 244 | + |
| 245 | + // Create the string UUIDs |
| 246 | + df = df.select(vec![uuid().alias("string_uuid")])?; |
| 247 | + |
| 248 | + // Convert string UUIDs to canonical extension UUIDs |
| 249 | + let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default()); |
| 250 | + df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?; |
| 251 | + |
| 252 | + // Extract version number from canonical extension UUIDs |
| 253 | + let version = ScalarUDF::new_from_impl(UuidVersion::default()); |
| 254 | + df = df.with_column("version", version.call(vec![col("uuid")]))?; |
| 255 | + |
| 256 | + // Convert back to a string |
| 257 | + let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default()); |
| 258 | + df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?; |
| 259 | + |
| 260 | + df.show().await?; |
| 261 | + |
| 262 | + Ok(()) |
| 263 | +} |
| 264 | +``` |
| 265 | + |
| 266 | +The [example repository] also contains a crate that demonstrates how to expose these |
| 267 | +UDFs to [datafusion-python]. This requires version 48.0.0 or later. |
| 268 | + |
| 269 | +[example repository]: https://github.com/timsaucer/datafusion_extension_type_examples |
| 270 | +[arrow-rs]: https://github.com/apache/arrow-rs |
| 271 | +[datafusion-python]: https://datafusion.apache.org/python/ |
| 272 | + |
| 273 | +## Other use cases |
| 274 | + |
| 275 | +The metadata attached to the fields can be used to store *any* user data in key/value |
| 276 | +pairs. Some of the other use cases that have been identified include: |
| 277 | + |
| 278 | +- Creating output for downstream systems. One user of DataFusion produces |
| 279 | + [data visualizations] that are dependant upon metadata in record batch fields. By |
| 280 | + enabling metadata on output of user defined functions, we can now produce batches |
| 281 | + that are directly consumable by these systems. |
| 282 | +- Describe the relationships between columns of data. You can store data about how |
| 283 | + one column of data relates to another and use these during function evaluation. For |
| 284 | + example, in robotics it is common to use [transforms] to describe how to convert |
| 285 | + from one coordinate system to another. It can be convenient to send the function |
| 286 | + all the columns that contain transform information and then allow the function |
| 287 | + to determine which columns to use based on the metadata. This allows for |
| 288 | + encapsulation of the transform logic within the user function. |
| 289 | +- Storing logical types of the data model. [InfluxDB] uses field metadata to specify |
| 290 | + which columns are used for tags, times, and fields. |
| 291 | + |
| 292 | +Based on the experience of the authors, we recommend caution when using metadata |
| 293 | +for use cases other than type extension. One issue that can arises is that as columns |
| 294 | +are used to compute new fields, some functions may pass through the metadata and the |
| 295 | +semantic meaning may change. For example, suppose you decided to use metadata to |
| 296 | +store some kind of statistics for the entire stream of record batches. Then you pass |
| 297 | +that column through a filter that removes many rows of data. Your statistics |
| 298 | +metadata may now be invalid, even though it was passed through the filter. |
| 299 | + |
| 300 | +Similarly, if you use metadata to form relations between one column and another and |
| 301 | +the naming of the columns has changed at some point in your workflow, then the metadata |
| 302 | +may indicate an incorrect column of data it is referring to. This can be mitigated by |
| 303 | +not relying on column naming but rather adding additional metadata to all columns of |
| 304 | +interest. |
| 305 | + |
| 306 | +[data visualizations]: https://rerun.io/blog/column-chunks |
| 307 | +[transforms]: https://wiki.ros.org/tf2 |
| 308 | +[InfluxDB]: https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout/ |
| 309 | + |
| 310 | +## Acknowledgements |
| 311 | + |
| 312 | +We would like to thank [Rerun.io] for sponsoring the development of this work. [Rerun.io] |
| 313 | +is building a data visualization system for Physical AI and uses metadata to specify |
| 314 | +context about columns in Arrow record batches. |
| 315 | + |
| 316 | +[Rerun.io]: https://rerun.io |
| 317 | + |
| 318 | +## Conclusion |
| 319 | + |
| 320 | +The enhanced metadata handling in [DataFusion 48.0.0] is a significant step |
| 321 | +forward in the ability to handle more interesting types of data. Users can |
| 322 | +validate the input data matches the intent of the data to be processed, enable |
| 323 | +complex operations on binary data because we understand the encoding used, and |
| 324 | +use metadata to create new and interesting user defined data types. |
| 325 | +We can't wait to see what you build with it! |
| 326 | + |
| 327 | +## Get Involved |
| 328 | + |
| 329 | +The DataFusion team is an active and engaging community and we would love to have you join |
| 330 | +us and help the project. |
| 331 | + |
| 332 | +Here are some ways to get involved: |
| 333 | + |
| 334 | +* Learn more by visiting the [DataFusion] project page. |
| 335 | +* Try out the project and provide feedback, file issues, and contribute code. |
| 336 | +* Work on a [good first issue]. |
| 337 | +* Reach out to us via the [communication doc]. |
| 338 | + |
| 339 | +[DataFusion]: https://datafusion.apache.org/index.html |
| 340 | +[good first issue]: https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 |
| 341 | +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html |
0 commit comments