-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
We're working on running some used-to-be-Spark pipelines through DataFusion. One case we've noticed where DataFusion doesn't support something is comparing lists. (Spark allows)[https://github.com/apache/spark/blame/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L1025] comparing (==, !=, <, >, <=, >=, ..) columns of structs and lists, while in DataFusion those seem to throw:
For structs, from our internal testing:
ArrowError(InvalidArgumentError("Invalid comparison operation: Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]) <= Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }])"), None)
For lists, this is shown in DataFusion's tests:
| query error DataFusion error: Arrow error: Invalid argument error: Invalid comparison operation: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) == List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) |
Maybe this would need to be improved on Arrow directly, seeing that the error is coming from https://github.com/apache/arrow-rs/blob/087f34b70e97ee85e1a54b3c45c5ed814f500b0a/arrow-ord/src/cmp.rs#L219?
Describe the solution you'd like
Binary predicates to be allowed for structs and lists, preferably following same semantics as in Spark (mostly I think it's a DFS over all the fields https://github.com/apache/spark/blob/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala#L285)
Describe alternatives you've considered
No response
Additional context
Related to #2326