Skip to content

Commit 31f9668

Browse files
timsauceralamb
andauthored
Metadata handling announcement (#73)
* Starting draft for blog post about metadata handling * Additional description of scalar UDF code work * Flushing out more text and examples * Adding text from review suggestion * Update date * Add additional use cases * typo * Adjust date * Responding to user input * apply suggestion on nullability * Add table of contents * Add links * Update content/blog/2025-09-21-custom-types-using-metadata.md Co-authored-by: Andrew Lamb <[email protected]> * Update content/blog/2025-09-21-custom-types-using-metadata.md Co-authored-by: Andrew Lamb <[email protected]> * Add usage reference * Proposed enhancement to intro and conclusion of Metadata handling blog (#114) * Proposed update to introduction * Reword conclusion * One small correction to fix links * Back to back links did not render properly --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent fe4740f commit 31f9668

File tree

2 files changed

+341
-0
lines changed

2 files changed

+341
-0
lines changed
Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
---
2+
layout: post
3+
title: Implementing User Defined Types and Custom Metadata in DataFusion
4+
date: 2025-09-21
5+
author: Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)
6+
categories: [core]
7+
---
8+
9+
<!--
10+
{% comment %}
11+
Licensed to the Apache Software Foundation (ASF) under one or more
12+
contributor license agreements. See the NOTICE file distributed with
13+
this work for additional information regarding copyright ownership.
14+
The ASF licenses this file to you under the Apache License, Version 2.0
15+
(the "License"); you may not use this file except in compliance with
16+
the License. You may obtain a copy of the License at
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
Unless required by applicable law or agreed to in writing, software
19+
distributed under the License is distributed on an "AS IS" BASIS,
20+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
21+
See the License for the specific language governing permissions and
22+
limitations under the License.
23+
{% endcomment %}x
24+
-->
25+
26+
[TOC]
27+
28+
[Apache DataFusion] [version 48.0.0] significantly improves support for user
29+
defined types and metadata. The user defined function APIs let users access
30+
metadata on the input columns to functions and produce metadata in the output.
31+
32+
## User defined types == extension types
33+
34+
DataFusion directly uses [Apache Arrow]'s [DataTypes] as its type system. This
35+
has several benefits including being simple to explain, supports a rich set of
36+
both scalar and nested types, true zero copy interoperability with other Arrow
37+
implementations, and world-class library support (via [arrow-rs]). However, one
38+
challenge of directly using the Arrow type system is there is no distinction
39+
between logical types and physical types. For example, the Arrow type system
40+
contains multiple types which can store "String"s (sequences of UTF8 encoded
41+
bytes) such as `Utf8`, `LargeUTF8`, `Dictionary(Utf8)`, and `Utf8View`.
42+
43+
However, Apache Arrow does provide [extension types], a version of logical type
44+
information, which describe how to interpret data stored in one of the existing
45+
physical types. With the improved support for metadata in DataFusion 48.0.0, it
46+
is now easier to implement user defined types using Arrow extension types.
47+
48+
## Metadata in Apache Arrow `Field`s
49+
50+
The [Arrow specification] defines Metadata as a map of key-value pairs of
51+
strings. This metadata is used to attach extension types and use case-specific
52+
context to a column of values. The Rust implementation of Apache Arrow,
53+
[arrow-rs], stores metadata on [Field]s, but prior to DataFusion 48.0.0, many of
54+
DataFusion's internal APIs used [DataTypes] directly, and thus did not propagate
55+
metadata through all operations.
56+
57+
[Arrow specification]: https://arrow.apache.org/docs/format/Columnar.html
58+
[Field]: https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html
59+
[DataTypes]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html
60+
61+
In previous versions of DataFusion `Field` metadata was propagated through certain
62+
operations (e.g., renaming or selecting a column) but was not
63+
others (e.g., scalar, window, or aggregate function calls). In DataFusion 48.0.0,
64+
and later, all user defined functions are passed the full
65+
input `Field` information and can return `Field` information to the caller.
66+
67+
Supporting extension types was a key motivation for adding metadata to the
68+
function processing, the same mechanism can store arbitrary metadata on the
69+
input and output fields, which supports other interesting use cases as we describe
70+
later in this post.
71+
72+
[Apache DataFusion]: https://datafusion.apache.org
73+
[Apache Arrow]: https://arrow.apache.org
74+
[Field]: https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html
75+
[version 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
76+
[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
77+
[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types
78+
[arrow-rs]: https://crates.io/crates/arrow
79+
80+
## Metadata handling
81+
82+
Data in Arrow record batches carry a [Schema] in addition to the Arrow arrays. Each
83+
[Field] in this `Schema` contains a name, data type, nullability, and metadata. The
84+
metadata is specified as a map of key-value pairs of strings. In the new
85+
implementation, during processing of all user defined functions we pass the input
86+
field information.
87+
88+
<figure>
89+
<img src="/blog/images/metadata-handling/arrow_record_batch.png" alt="Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns." width="100%" class="img-responsive">
90+
<figcaption>
91+
<b>Figure 1:</b> Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns.
92+
</figcaption>
93+
</figure>
94+
95+
[Schema]: https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html
96+
97+
It is often desirable to write a generic function for reuse. Prior versions of
98+
user defined functions only had access to the `DataType` of the input columns.
99+
This works well for some features that only rely on the types of data, but other
100+
use cases may need additional information that describes the data.
101+
102+
For example, suppose I wish to write a function that takes in a UUID and returns a string
103+
of the [variant] of the input field. We would want this function to be able to handle
104+
all of the string types and also a binary encoded UUID. The Arrow specification does not
105+
contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary
106+
array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0]
107+
we can validate during planning that the input data not only has the correct underlying
108+
data type, but that it also represents the right *kind* of data. The UUID example is a
109+
common one, and it is included in the [canonical extension types] that are now
110+
supported in DataFusion.
111+
112+
Another common application of metadata handling is understanding encoding of a blob of data.
113+
Suppose you have a column that contains image data. Most likely this data is stored as
114+
an array of `u8` data. Without knowing a priori what the encoding of that blob of data is,
115+
you cannot ensure you are using the correct methods for decoding it. You may work around
116+
this by adding another column to your data source indicating the encoding, but this can be
117+
wasteful for systems where the encoding never changes. Instead, you could use metadata to
118+
specify the encoding for the entire column.
119+
120+
[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
121+
[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
122+
[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html
123+
124+
## How to use metadata in user defined functions
125+
126+
When working with metadata for [user defined scalar functions], there are typically two
127+
places in the function definition that require implementation.
128+
129+
- Computing the return field from the arguments
130+
- Invocation
131+
132+
During planning, we will attempt to call the function [return_field_from_args()]. This will
133+
provide a list of input fields to the function and return the output field. To evaluate
134+
metadata on the input side, you can write a functions similar to this example:
135+
136+
[user defined scalar functions]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
137+
[return_field_from_args()]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.return_field_from_args
138+
139+
```rust
140+
fn return_field_from_args(
141+
&self,
142+
args: ReturnFieldArgs,
143+
) -> datafusion::common::Result<FieldRef> {
144+
if args.arg_fields.len() != 1 {
145+
return exec_err!("Incorrect number of arguments for uuid_version");
146+
}
147+
148+
let input_field = &args.arg_fields[0];
149+
if &DataType::FixedSizeBinary(16) == input_field.data_type() {
150+
let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type()
151+
else {
152+
return exec_err!("Input field must contain the UUID canonical extension type");
153+
};
154+
}
155+
156+
let is_nullable = args.arg_fields[0].is_nullable();
157+
158+
Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable)))
159+
}
160+
```
161+
162+
In this example, we take advantage of the fact that we already have support for extension
163+
types that evaluate metadata. If you were attempting to check for metadata other than
164+
extension type support, we could have instead written a snippet such as:
165+
166+
```rust
167+
if &DataType::FixedSizeBinary(16) == input_field.data_type() {
168+
let _ = input_field
169+
.metadata()
170+
.get("ARROW:extension:metadata")
171+
.ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?;
172+
};
173+
}
174+
```
175+
176+
If you are writing a user defined function that will instead return metadata on output
177+
you can add this directly into the `Field` that is the output of the `return_field_from_args`
178+
call. In our above example, we could change the return line to:
179+
180+
```rust
181+
Ok(Arc::new(
182+
Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata(
183+
[("my_key".to_string(), "my_value".to_string())]
184+
.into_iter()
185+
.collect(),
186+
),
187+
))
188+
```
189+
190+
By checking the metadata during the planning process, we can identify errors early in
191+
the query process. There are cases were we wish to have access to this metadata during
192+
execution as well. The function [invoke_with_args] in the user defined function takes
193+
the updated struct [ScalarFunctionArgs]. This now contains the input fields, which can
194+
be used to check for metadata. For example, you can do the following:
195+
196+
[invoke_with_args]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.invoke_with_args
197+
[ScalarFunctionArgs]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html
198+
199+
```rust
200+
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
201+
assert_eq!(args.arg_fields.len(), 1);
202+
let my_value = args.arg_fields[0]
203+
.metadata()
204+
.get("encoding_type");
205+
...
206+
```
207+
208+
In this snippet we have extracted an `Option<String>` from the input field metadata
209+
which we can then use to determine which functions we might want to call. We could
210+
then parse the returned value to determine what type of encoding to use when
211+
evaluating the array in the arguments. Since `return_field_from_args` is not `&mut self`
212+
this check could not be performed during the planning stage.
213+
214+
The description in this section applies to scalar user defined functions, but equivalent
215+
support exists for aggregate and window functions.
216+
217+
## Extension types
218+
219+
Extension types are one of the primary motivations for this enhancement in
220+
[Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, [arrow-rs],
221+
already contains support for the [canonical extension types]. This support includes
222+
helper functions such as `try_canonical_extension_type()` in the earlier example.
223+
224+
For a concrete example of how extension types can be used in DataFusion functions,
225+
there is an [example repository] that demonstrates using UUIDs. The UUID extension
226+
type specifies that the data are stored as a Fixed Size Binary of length 16. In the
227+
DataFusion core functions, we have the ability to generate string representations of
228+
UUIDs that match the version 4 specification. These are helpful, but a user may
229+
wish to do additional work with UUIDs where having them in the dense representation
230+
is preferable. Alternatively, the user may already have data with the binary encoding
231+
and we want to extract values such as the version, timestamp, or string
232+
representation.
233+
234+
In the example repository we have created three user defined functions: `UuidVersion`,
235+
`StringToUuid`, and `UuidToString`. Each of these implements `ScalarUDFImpl` and can
236+
be used thusly:
237+
238+
```rust
239+
async fn main() -> Result<()> {
240+
let ctx = create_context()?;
241+
242+
// get a DataFrame from the context
243+
let mut df = ctx.table("t").await?;
244+
245+
// Create the string UUIDs
246+
df = df.select(vec![uuid().alias("string_uuid")])?;
247+
248+
// Convert string UUIDs to canonical extension UUIDs
249+
let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default());
250+
df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?;
251+
252+
// Extract version number from canonical extension UUIDs
253+
let version = ScalarUDF::new_from_impl(UuidVersion::default());
254+
df = df.with_column("version", version.call(vec![col("uuid")]))?;
255+
256+
// Convert back to a string
257+
let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default());
258+
df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?;
259+
260+
df.show().await?;
261+
262+
Ok(())
263+
}
264+
```
265+
266+
The [example repository] also contains a crate that demonstrates how to expose these
267+
UDFs to [datafusion-python]. This requires version 48.0.0 or later.
268+
269+
[example repository]: https://github.com/timsaucer/datafusion_extension_type_examples
270+
[arrow-rs]: https://github.com/apache/arrow-rs
271+
[datafusion-python]: https://datafusion.apache.org/python/
272+
273+
## Other use cases
274+
275+
The metadata attached to the fields can be used to store *any* user data in key/value
276+
pairs. Some of the other use cases that have been identified include:
277+
278+
- Creating output for downstream systems. One user of DataFusion produces
279+
[data visualizations] that are dependant upon metadata in record batch fields. By
280+
enabling metadata on output of user defined functions, we can now produce batches
281+
that are directly consumable by these systems.
282+
- Describe the relationships between columns of data. You can store data about how
283+
one column of data relates to another and use these during function evaluation. For
284+
example, in robotics it is common to use [transforms] to describe how to convert
285+
from one coordinate system to another. It can be convenient to send the function
286+
all the columns that contain transform information and then allow the function
287+
to determine which columns to use based on the metadata. This allows for
288+
encapsulation of the transform logic within the user function.
289+
- Storing logical types of the data model. [InfluxDB] uses field metadata to specify
290+
which columns are used for tags, times, and fields.
291+
292+
Based on the experience of the authors, we recommend caution when using metadata
293+
for use cases other than type extension. One issue that can arises is that as columns
294+
are used to compute new fields, some functions may pass through the metadata and the
295+
semantic meaning may change. For example, suppose you decided to use metadata to
296+
store some kind of statistics for the entire stream of record batches. Then you pass
297+
that column through a filter that removes many rows of data. Your statistics
298+
metadata may now be invalid, even though it was passed through the filter.
299+
300+
Similarly, if you use metadata to form relations between one column and another and
301+
the naming of the columns has changed at some point in your workflow, then the metadata
302+
may indicate an incorrect column of data it is referring to. This can be mitigated by
303+
not relying on column naming but rather adding additional metadata to all columns of
304+
interest.
305+
306+
[data visualizations]: https://rerun.io/blog/column-chunks
307+
[transforms]: https://wiki.ros.org/tf2
308+
[InfluxDB]: https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout/
309+
310+
## Acknowledgements
311+
312+
We would like to thank [Rerun.io] for sponsoring the development of this work. [Rerun.io]
313+
is building a data visualization system for Physical AI and uses metadata to specify
314+
context about columns in Arrow record batches.
315+
316+
[Rerun.io]: https://rerun.io
317+
318+
## Conclusion
319+
320+
The enhanced metadata handling in [DataFusion 48.0.0] is a significant step
321+
forward in the ability to handle more interesting types of data. Users can
322+
validate the input data matches the intent of the data to be processed, enable
323+
complex operations on binary data because we understand the encoding used, and
324+
use metadata to create new and interesting user defined data types.
325+
We can't wait to see what you build with it!
326+
327+
## Get Involved
328+
329+
The DataFusion team is an active and engaging community and we would love to have you join
330+
us and help the project.
331+
332+
Here are some ways to get involved:
333+
334+
* Learn more by visiting the [DataFusion] project page.
335+
* Try out the project and provide feedback, file issues, and contribute code.
336+
* Work on a [good first issue].
337+
* Reach out to us via the [communication doc].
338+
339+
[DataFusion]: https://datafusion.apache.org/index.html
340+
[good first issue]: https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
341+
[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html
220 KB
Loading

0 commit comments

Comments
 (0)