Skip to content

Conversation

@stuartcarnie
Copy link
Contributor

Which issue does this PR close?

N/A

Rationale for this change

Adds the ability for the SortMergeJoin physical node to join on binary types:

  • Binary,
  • FixedSizeBinary
  • BinaryView
  • LargeBinary

What changes are included in this PR?

  • An update to the comparitor functions to support the listed binary types
  • Tests to verify binary types are supported via the ON clause

Are these changes tested?

Are there any user-facing changes?

The documentation does not list the subset of types supported by the ON clause of SortMergeJoin.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 5, 2025
Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @stuartcarnie this looks good to me. I'm surprised binary data isnt part of the join fuzz testing, this could be put in a follow up issue

@stuartcarnie
Copy link
Contributor Author

I'm surprised binary data isnt part of the join fuzz testing, this could be put in a follow up issue

Sounds good. Where is the fuzz testing, as I was looking for a place to write some test SQL to verify at a higher level via some integration tests.

}

#[tokio::test]
async fn join_fixed_size_binary() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should include large binary test as well? 🤔 I noticed the amount of unit tests were getting out of hand, maybe I'll look into separating the tests into the other SMJ files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great if you could do so

@jonathanc-n
Copy link
Contributor

jonathanc-n commented Sep 5, 2025

I'm surprised binary data isnt part of the join fuzz testing, this could be put in a follow up issue

Sounds good. Where is the fuzz testing, as I was looking for a place to write some test SQL to verify at a higher level via some integration tests.

The fuzz testing is in datafusion/core/tests/fuzz_cases/join_fuzz.rs, I forget if it contains the data generation as well.

Integration tests might be overkill but it'd be nice to see if the binary values are working as we would expect. You could write those in the .slt tests we have.

@jonathanc-n
Copy link
Contributor

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 5, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @stuartcarnie and @jonathanc-n

I added some additional sqllogictests as @jonathanc-n suggested and verified they fail without the code changes in this PR

Details

Completed 1 test files in 0 seconds                                                                                                                                                                         External error: 4 errors in file /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/sort_merge_join.slt

1. query failed: DataFusion error: This feature is not implemented: Unsupported data type in sort merge join comparator: Binary
[SQL] with t1 as (select arrow_cast(x, 'Binary') as x, id1 from t1),
     t2 as (select arrow_cast(y, 'Binary') as y, id2 from t2)
select * from t1 join t2 on t1.x = t2.y order by id1, id2
at /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/sort_merge_join.slt:850


2. query failed: DataFusion error: This feature is not implemented: Unsupported data type in sort merge join comparator: LargeBinary
[SQL] with t1 as (select arrow_cast(x, 'LargeBinary') as x, id1 from t1),
     t2 as (select arrow_cast(y, 'LargeBinary') as y, id2 from t2)
select * from t1 join t2 on t1.x = t2.y order by id1, id2
at /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/sort_merge_join.slt:859


3. query failed: DataFusion error: This feature is not implemented: Unsupported data type in sort merge join comparator: BinaryView
[SQL] with t1 as (select arrow_cast(x, 'BinaryView') as x, id1 from t1),
     t2 as (select arrow_cast(y, 'BinaryView') as y, id2 from t2)
select * from t1 join t2 on t1.x = t2.y order by id1, id2
at /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/sort_merge_join.slt:868


4. query failed: DataFusion error: This feature is not implemented: Unsupported data type in sort merge join comparator: FixedSizeBinary(2)
[SQL] with t1 as (select arrow_cast(arrow_cast(x, 'Binary'), 'FixedSizeBinary(2)') as x, id1 from t1),
     t2 as (select arrow_cast(arrow_cast(y, 'Binary'), 'FixedSizeBinary(2)') as y, id2 from t2)
select * from t1 join t2 on t1.x = t2.y order by id1, id2
at /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/sort_merge_join.slt:877

}

#[tokio::test]
async fn join_fixed_size_binary() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great if you could do so

@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

I'm surprised binary data isnt part of the join fuzz testing, this could be put in a follow up issue

Sounds good. Where is the fuzz testing, as I was looking for a place to write some test SQL to verify at a higher level via some integration tests.

I filed a ticket to track adding support to fuzz testing:

@stuartcarnie
Copy link
Contributor Author

I added some additional sqllogictests as @jonathanc-n suggested and verified they fail without the code changes in this PR

That's great, thanks @alamb!

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stuartcarnie looks like a nice improvement 💪

@comphead comphead merged commit 7b65c5b into apache:main Sep 6, 2025
28 checks passed
alamb added a commit to alamb/datafusion that referenced this pull request Sep 8, 2025
…he#17431)

* feat: Support binary data types for `SortMergeJoin` `on` clause

* Add sql level tests for merge join on binary keys

---------

Co-authored-by: Andrew Lamb <[email protected]>
@stuartcarnie stuartcarnie deleted the smj_binary_types branch September 8, 2025 22:29
crepererum pushed a commit to influxdata/arrow-datafusion that referenced this pull request Sep 9, 2025
…he#17431)

* feat: Support binary data types for `SortMergeJoin` `on` clause

* Add sql level tests for merge join on binary keys

---------

Co-authored-by: Andrew Lamb <[email protected]>
erratic-pattern pushed a commit to influxdata/arrow-datafusion that referenced this pull request Oct 6, 2025
…he#17431)

* feat: Support binary data types for `SortMergeJoin` `on` clause

* Add sql level tests for merge join on binary keys

---------

Co-authored-by: Andrew Lamb <[email protected]>
erratic-pattern pushed a commit to influxdata/arrow-datafusion that referenced this pull request Oct 21, 2025
…he#17431)

* feat: Support binary data types for `SortMergeJoin` `on` clause

* Add sql level tests for merge join on binary keys

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants