-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
I've noticed there are some issues regarding adding extension types in DataFusion.
- Spatial data support #7859
- Any plan to support JSON or JSONB? #7845
- Add support for Arrow Extension Types arrow-rs#4472
Providing an interface for adding extension types in DataFusion would be highly meaningful. This would allow applications built on DataFusion to easily incorporate business-specific data types.
I hope to promote the development of the UDT feature through this current proposal.
Describe the solution you'd like
User-Defined Types (UDT)
UDT stands for User-Defined Type. It is a feature in database systems that allows users to define their own custom data types based on existing data types provided by the database. This feature enables users to create data structures tailored to their specific needs, providing a higher level of abstraction and organization for complex data.
Syntax
<user-defined type definition> ::=
CREATE TYPE <user-defined type name> AS <representation>
<representation> ::=
<predefined type>
| <member list>
<member list> ::=
<left paren> <member> [ { <comma> <member> }... ] <right paren>
<member> ::=
<attribute name> <data type>
<attribute name> ::=
<identifier>Behaviors
Behaviors of Data Types
- Type matching assessment.
- Computation of the common super type for two types.
Behaviors of Data
- Inference of data type from literal value.
- Casting literal value to other type.
- Casting variable value to other type.
- Import and export of data. (Sensitive to logical data types)
- Operations like data comparison, etc.
Role of Data Types in the SQL Lifecycle
SQL Statement String -> AST
None
AST -> Logical Plan
- Create Type
- Parsing, constructing, and storing the description of UDT.
- Create Table (Using Type)
- Parsing data types
- Built-in types
- User-defined types
- Constructing DFField (using metadata field to tag extended types), storing metadata.
- Parsing data types
- Query
- How to construct extended data types?
- Use the STRUCT function.
- Use UDF.
- How to perform relationship (comparison) operations, logical operations, arithmetic operations with other data types? How to perform type conversion?
- Constant to UDT
- Use arrow conversion rules.
- Variable to UDT
- Judge if cast can be performed according to arrow rules, and add cast expression as needed.
- UDT to other data types
- Judge if cast can be performed according to arrow rules, and add cast expression as needed.
e.g. Any binary to UUID (DataType::FixedSizeBinary(16)), if data layout is the same but data content format is different, conversion is not possible. But from my understanding, UDT is not related to data content, only to data type, so this is not a problem.
- Constant to UDT
- Hashing, sorting?
- Use arrow DataType.
- How to construct extended data types?
Logical Plan -> Execution Plan
None
Execution Plan -> ResultSet
- Cast
- Execute according to arrow DataType's cast logic.
- Comparison, operations, etc.
- Execute according to arrow DataType's logic.
- TableScan/TableWrite
- Identify extended types through Field metadata, thus performing special serialization or deserialization.
Core Structures
/// UDT Signature
/// <udt_name>[ (<param>[ {,<param>}... ]) ]
pub struct TypeSignature<'a'> {
name: Cow<'a, str>,
params: Vec<Cow<'a, str>>,
}
/// UDT Entity
pub struct UserDefinedType {
signature: TypeSignature,
physical_type: DataType,
}
impl UserDefinedType {
/// Physical data type
pub fn arrow_type(&self) -> DataType;
/// Metadata used to tag extended data types
pub fn metadata(&self) -> HashMap<String, String>;
}
pub trait ContextProvider {
/// Get UDT description by signature
fn udt(&self, type_signature: TypeSignature) -> Result<Arc<UserDefinedType>>;
......
}Examples
create udt
CREATE TYPE user_id_t AS BIGINT;
CREATE TYPE email_t AS String;
CREATE TYPE person_t AS (
user_id user_id_t,
first_name String,
last_name String,
age INTEGER,
email email_t);
DROP TYPE person_t;
DROP TYPE email_t;
DROP TYPE user_id_t;geoarrow
https://github.com/geoarrow/geoarrow/blob/main/extension-types.md
Point
type_signature: Geometry(Point)
arrow_type: DataType::FixedSizeList(xy, 2)
metadata: { "ARROW:extension:name": "geoarrow.point" }
Questions
- Is the UDF sensitive to extended types (e.g., encoding of extended type data in binary, where type tagging exists only in Field metadata and cannot be obtained during UDF runtime)?
Describe alternatives you've considered
No response
Additional context
@alamb I am particularly eager to receive your feedback or suggestions on this proposal. Additionally, I highly encourage individuals who are familiar with or interested in this feature to contribute their improvement ideas.