Skip to content

Conversation

@wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jul 12, 2024

BREAKING CHANGE:

  • Removed access to MultipartId, cleanup_partial_writes. Uploads are now automatically cleaned up.
    • FragmentWriteProgress no longer has any multipart_id argument.
  • CommitHandler and ManifestWriter APIs now take a Lance ObjectStore rather than object_store::ObjectStore.

New features:

  • For S3 and Azure (these features were already implemented for GCS):
    • Max object size is now 2.5 TB, up from 50GB.
    • Retries are now performed for Connection reset network errors.
    • Reduced number of IOPS for writing small objects (<5MB)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this was ported from the now deleted gcs_wrapper.rs

Comment on lines +277 to 290
impl Drop for ObjectWriter {
fn drop(&mut self) {
// If there is a multipart upload started but not finished, we should abort it.
if matches!(self.state, UploadState::InProgress { .. }) {
// Take ownership of the state.
let state = std::mem::replace(&mut self.state, UploadState::Done);
if let UploadState::InProgress { mut upload, .. } = state {
tokio::task::spawn(async move {
let _ = upload.abort().await;
});
}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what ensures partial uploads are cleaned up now. MultipartId is no longer exposed in the object-store API.

@wjones127 wjones127 force-pushed the feat/upgrade-datafusion branch from 8d805cf to 66dec81 Compare July 13, 2024 03:12
@codecov-commenter
Copy link

codecov-commenter commented Jul 13, 2024

Codecov Report

Attention: Patch coverage is 53.81679% with 242 lines in your changes missing coverage. Please review.

Project coverage is 80.22%. Comparing base (8cab512) to head (895d2ee).
Report is 1 commits behind head on main.

Files Patch % Lines
rust/lance-io/src/object_writer.rs 67.93% 71 Missing and 13 partials ⚠️
rust/lance-datagen/src/generator.rs 5.12% 37 Missing ⚠️
rust/lance-index/src/scalar/btree.rs 4.00% 23 Missing and 1 partial ⚠️
rust/lance/src/io/exec/planner.rs 34.48% 18 Missing and 1 partial ⚠️
rust/lance/src/io/exec/scalar_index.rs 15.38% 11 Missing ⚠️
rust/lance/src/io/exec/knn.rs 43.75% 9 Missing ⚠️
rust/lance-index/src/scalar/expression.rs 0.00% 6 Missing ⚠️
rust/lance-io/src/object_store.rs 72.22% 3 Missing and 2 partials ⚠️
rust/lance/src/io/exec/utils.rs 0.00% 5 Missing ⚠️
rust/lance-datafusion/src/exec.rs 0.00% 4 Missing ⚠️
... and 15 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2594      +/-   ##
==========================================
+ Coverage   79.97%   80.22%   +0.24%     
==========================================
  Files         212      211       -1     
  Lines       61556    61418     -138     
  Branches    61556    61418     -138     
==========================================
+ Hits        49228    49270      +42     
+ Misses       9402     9217     -185     
- Partials     2926     2931       +5     
Flag Coverage Δ
unittests 80.22% <53.81%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 marked this pull request as ready for review July 13, 2024 04:25
// Native types, which don't support Distribution.
// IntervalUnit::DayTime => rand::<IntervalDayTimeType>(),
// IntervalUnit::MonthDayNano => rand::<IntervalMonthDayNanoType>(),
IntervalUnit::DayTime | IntervalUnit::MonthDayNano => todo!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wjones127 wjones127 requested a review from BubbleCal July 15, 2024 20:22
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite exhausting. Thanks for the cleanup

Comment on lines -1666 to +1720
DataType::Interval(unit) => match unit {
IntervalUnit::DayTime => rand::<IntervalDayTimeType>(),
IntervalUnit::MonthDayNano => rand::<IntervalMonthDayNanoType>(),
IntervalUnit::YearMonth => rand::<IntervalYearMonthType>(),
},
DataType::Interval(unit) => rand_interval(*unit),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they changed the underlying native type to a struct rather than i128/i64. Thus Distribution was no longer implemented.

apache/arrow-rs#5769

Comment on lines -406 to -410
pub async fn cleanup_partial_writes(
store: &ObjectStore,
base_path: &Path,
objects: impl IntoIterator<Item = (&Path, &String)>,
) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why get rid of this? Was it no longer working? Or no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object_store no longer exposes the multipart ids. So now users can't collect them and pass them into this function.

Instead, we automatically cleanup on Drop for ObjectWriter. In case of crashes, users should rely on setting a lifecycle rule to delete old incomplete writes. This is already recommended in our docs.

@wjones127 wjones127 merged commit 30af1d8 into lance-format:main Jul 16, 2024
@wjones127 wjones127 deleted the feat/upgrade-datafusion branch July 16, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants