Skip to content

Conversation

@sdf-jkl
Copy link
Contributor

@sdf-jkl sdf-jkl commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Added a flatten() List(LargeList) test to the sqllogictest

Added support for array flatten() on List(LargeList(_)) types

Are these changes tested?

sqllogictest passes, but I still need to implement a test where offsets could not be downcasted from i64 to i32

Are there any user-facing changes?

Users will be able to use flatten on List(LargeList) types

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 29, 2025
Copy link
Contributor Author

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently is not a finished solution.

Still need to figure out how to create a testcase where offsets would not be able to downcast to i32.

Should it be a sqllogictest test using an array with i32::MAX + 1 values or something else?

Comment on lines 100 to 101
List(field) | LargeList(field) | FixedSizeList(field, _) => {
List(Arc::clone(field))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this only supports arrays that can be converted from LargeList to List.

Comment on lines 161 to 184
let (inner_field, inner_offsets, inner_values, _) =
as_large_list_array(&values)?.clone().into_parts();
// Try to downcast the inner offsets to i32
match downcast_i64_inner_to_i32(&inner_offsets, &offsets) {
Ok(i32offsets) => {
let flattened_array = GenericListArray::<i32>::new(
inner_field,
i32offsets,
inner_values,
nulls,
);
Ok(Arc::new(flattened_array) as ArrayRef)
}
// If downcast fails we keep the offsets as is
Err(_) => {
// Fallback: keep i64 offsets → LargeList<i64>
let i64offsets = keep_offsets_i64(inner_offsets, offsets);
let flattened_array = GenericListArray::<i64>::new(
inner_field,
i64offsets,
inner_values,
nulls,
);
Ok(Arc::new(flattened_array) as ArrayRef)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are trying to downcast the inner array offsets from i64 to i32 and if fail we fallback to i64 offsets.
The fallback is not yet supported by the return_type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me as it means the return value of the function might not conform to the stated return_type

If return type says the return array will be ListArray for a LargeListArray input, then the invoke function must return that value (or an error)

SO I think we need to either update the implementation and return typetype to return LargeListArray for LargeListArray inputs or else return an error if we can't cast the offsets correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at @Jefffrey's original comment suggesting that we should try to downcast it i32 returning a ListArray and alternatively return a LargeListArray if downcast is not possible.

He also mentioned that upcasting the parent offsets to LargeListArray blindly would be inefficient and undesirable.

I wanted to see if it's possible to infer the possibility of downcasting the inner array offsets within the return_type function. Then return the matching output type.

We could do it by checking if: inner_offsets.last() <= i32::MAX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I did mention that, however I did make a note that:

though this might be tricky considering return_type() wouldn't know this until execution

I think we make a choice, either always upcast to large if one of the inner lists is a large list, or just error.

I guess the former is more favourable/robust than the latter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if the consensus is to always upcast to LargeListArray I'll proceed with it.

}
LargeList(_) => {
let (inner_field, inner_offsets, inner_values, nulls) =
let (inner_field, inner_offsets, inner_values, _) = // _ instead of nulls?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here flattened_array was generated using the inner nulls instead of the outer ones. I figured it should follow the same format and use nulls from the outer array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting; so this might be a bug in current behaviour? Do you think you can mock up a test where it fails on main without this fix?

Comment on lines 235 to 236
inner_offsets: OffsetBuffer<O>,
outer_offsets: OffsetBuffer<O>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little naming change because the previous one was confusing.

Comment on lines 259 to 274
// Function for converting LargeList offsets into List offsets
fn downcast_i64_inner_to_i32(
inner_offsets: &OffsetBuffer<i64>,
outer_offsets: &OffsetBuffer<i32>,
) -> Result<OffsetBuffer<i32>, ArrowError> {
let buffer = inner_offsets.clone().into_inner();
let offsets: Result<Vec<i32>, _> = outer_offsets
.iter()
.map(|i| buffer[i.to_usize().unwrap()])
.map(|i| {
i32::try_from(i)
.map_err(|_| ArrowError::CastError(format!("Cannot downcast offset {i}")))
})
.collect();
Ok(OffsetBuffer::new(offsets?.into()))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is trying to downcast offsets to i32.

  • On success it returns an OffsetBuffer<i32>
  • If it fails it errors out.

}

// In case the conversion fails we convert the outer offsets into i64
fn keep_offsets_i64(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes the outer i32 and inner i64 offsets and keeps the inner i64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be renamed (and docstring adjusted) now? keep_offsets_i64 is a bit confusing to read for me

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this @sdf-jkl

Comment on lines 161 to 184
let (inner_field, inner_offsets, inner_values, _) =
as_large_list_array(&values)?.clone().into_parts();
// Try to downcast the inner offsets to i32
match downcast_i64_inner_to_i32(&inner_offsets, &offsets) {
Ok(i32offsets) => {
let flattened_array = GenericListArray::<i32>::new(
inner_field,
i32offsets,
inner_values,
nulls,
);
Ok(Arc::new(flattened_array) as ArrayRef)
}
// If downcast fails we keep the offsets as is
Err(_) => {
// Fallback: keep i64 offsets → LargeList<i64>
let i64offsets = keep_offsets_i64(inner_offsets, offsets);
let flattened_array = GenericListArray::<i64>::new(
inner_field,
i64offsets,
inner_values,
nulls,
);
Ok(Arc::new(flattened_array) as ArrayRef)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me as it means the return value of the function might not conform to the stated return_type

If return type says the return array will be ListArray for a LargeListArray input, then the invoke function must return that value (or an error)

SO I think we need to either update the implementation and return typetype to return LargeListArray for LargeListArray inputs or else return an error if we can't cast the offsets correctly

@sdf-jkl sdf-jkl requested a review from alamb October 30, 2025 20:03
}

// In case the conversion fails we convert the outer offsets into i64
fn keep_offsets_i64(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be renamed (and docstring adjusted) now? keep_offsets_i64 is a bit confusing to read for me

Comment on lines 7741 to 7744
select flatten(arrow_cast(make_array([1], [2, 3], [null], make_array(4, null, 5)), 'FixedSizeList(4, LargeList(Int64))')),
flatten(arrow_cast(make_array([[1.1], [2.2]], [[3.3], [4.4]]), 'List(LargeList(FixedSizeList(1, Float64)))'));
----
[1, 2, 3, NULL, 4, NULL, 5] [[1.1], [2.2], [3.3], [4.4]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should also add a check for the output type via arrow_typeof() to ensure we are indeed getting a large list

}
LargeList(_) => {
let (inner_field, inner_offsets, inner_values, nulls) =
let (inner_field, inner_offsets, inner_values, _) = // _ instead of nulls?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting; so this might be a bug in current behaviour? Do you think you can mock up a test where it fails on main without this fix?

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Oct 31, 2025

Thanks for your review @Jefffrey, I've addressed the issues you pointed out.

I'll check the potential bug in main

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Oct 31, 2025

I made a test that points out the bug on a different branch, should I create a separate issue and a PR for it?

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Nov 1, 2025

@Jefffrey @alamb Should be good now.

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 2, 2025

Looks like some tests are failing in CI

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Nov 2, 2025

@Jefffrey Fixed.

@sdf-jkl sdf-jkl requested a review from Jefffrey November 2, 2025 21:37
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb alamb added this pull request to the merge queue Nov 3, 2025
@alamb
Copy link
Contributor

alamb commented Nov 3, 2025

Thank you @sdf-jkl and @Jefffrey

Merged via the queue into apache:main with commit 0a0ccb1 Nov 3, 2025
28 checks passed
@sdf-jkl sdf-jkl deleted the flatten-listLargeList-support branch November 3, 2025 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flatten using the wrong validity buffer for LargeList(LargeList) Support array flatten() on List(LargeList(_)) types

3 participants