Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion downstairs/src/extent.rs
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,22 @@ pub fn extent_dir<P: AsRef<Path>>(dir: P, number: ExtentId) -> PathBuf {
* anchored under "dir".
*/
pub fn extent_path<P: AsRef<Path>>(dir: P, number: ExtentId) -> PathBuf {
extent_dir(dir, number).join(extent_file_name(number, ExtentType::Data))
let e = extent_file_name(number, ExtentType::Data);

// XXX terrible hack: if someone has already provided a full directory tree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this so we can verify the incoming copy of an extent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, opening an extent takes the extent's root directory and number, then builds the extent file path internally. This is annoying if we want to open a specific raw file!

Since we're building a test image directly from this PR (and probably not merging it), I didn't bother to clean this up further.

// ending in `.copy`, then just append the extent file name. This lets us
// open individual extent files during live-repair.
if dir
.as_ref()
.iter()
.next_back()
.and_then(|s| s.to_str())
.is_some_and(|s| s.ends_with(".copy"))
{
dir.as_ref().join(e)
} else {
extent_dir(dir, number).join(e)
}
}

/**
Expand Down
5 changes: 2 additions & 3 deletions downstairs/src/extent_inner_sqlite.rs
Original file line number Diff line number Diff line change
Expand Up @@ -104,9 +104,8 @@ impl ExtentInner for SqliteInner {
}

fn validate(&self) -> Result<(), CrucibleError> {
Err(CrucibleError::GenericError(
"`validate` is not implemented for Sqlite extent".to_owned(),
))
// SQLite databases are always perfect and have no problems
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

Ok(())
}

#[cfg(test)]
Expand Down
46 changes: 46 additions & 0 deletions downstairs/src/region.rs
Original file line number Diff line number Diff line change
Expand Up @@ -400,6 +400,30 @@ impl Region {
assert_eq!(self.get_opened_extent(eid).number, eid);
}
assert_eq!(self.def.extent_count() as usize, self.extents.len());
use rayon::prelude::*;
let errors: Vec<_> = self
.extents
.par_iter()
.filter_map(|e| {
let ExtentState::Opened(extent) = e else {
unreachable!("got closed extent");
};
if let Err(err) = extent.validate() {
Some((extent.number, err))
} else {
None
}
})
.collect();
if !errors.is_empty() {
for (number, err) in &errors {
warn!(
self.log,
"validation falied for extent {number}: {err:?}"
);
}
panic!("validation failed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about this panic because we can't recover from it like we can the "check during repair" panic. Eventually the downstairs service will be in maintenance and the Upstairs can't do anything to fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we are "hunting" with this change, and not living with it forever, I think I like that we can't recover. I want things to stop as soon as we have detected a problem.

Looking for downstairs in maintenance (as well as core files) would help determine when we saw a problem.

I worry that perhaps we have a bad extent, but it gets "repaired" before we find it and we miss our window.

}
}

/// Walk the list of extents and close each one.
Expand Down Expand Up @@ -682,6 +706,28 @@ impl Region {
);
}

// Validate the extent that we just received, before copying it over
{
let new_extent = match Extent::open(
&copy_dir,
&self.def(),
eid,
true, // read-only
&self.log.clone(),
) {
Ok(e) => e,
Err(e) => {
panic!(
"Failed to open live-repair extent {eid} in \
{copy_dir:?}: {e:?}"
);
}
};
if let Err(e) = new_extent.validate() {
panic!("Failed to validate live-repair extent {eid}: {e:?}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe when we fail here, it will leave behind a copy_dir.
That should be handled "properly" when the downstairs restarts, as it should discard it and make a new one. We would lose the "bad" file, but the downstairs log should tell us what we need to know

}
}

// After we have all files: move the repair dir.
info!(
self.log,
Expand Down