-
Notifications
You must be signed in to change notification settings - Fork 26
Check hashes during repair and upon startup #1791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
4f84cf7
969f538
4d6d992
d32e0d4
b7a514d
0327ad5
df7bc52
253483d
1c43cd8
df77cda
61913aa
8a3d258
5903e65
44acc00
9063dde
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -104,9 +104,8 @@ impl ExtentInner for SqliteInner { | |
| } | ||
|
|
||
| fn validate(&self) -> Result<(), CrucibleError> { | ||
| Err(CrucibleError::GenericError( | ||
| "`validate` is not implemented for Sqlite extent".to_owned(), | ||
| )) | ||
| // SQLite databases are always perfect and have no problems | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😭 |
||
| Ok(()) | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -400,6 +400,30 @@ impl Region { | |
| assert_eq!(self.get_opened_extent(eid).number, eid); | ||
| } | ||
| assert_eq!(self.def.extent_count() as usize, self.extents.len()); | ||
| use rayon::prelude::*; | ||
| let errors: Vec<_> = self | ||
| .extents | ||
| .par_iter() | ||
| .filter_map(|e| { | ||
| let ExtentState::Opened(extent) = e else { | ||
| unreachable!("got closed extent"); | ||
| }; | ||
| if let Err(err) = extent.validate() { | ||
| Some((extent.number, err)) | ||
| } else { | ||
| None | ||
| } | ||
| }) | ||
| .collect(); | ||
| if !errors.is_empty() { | ||
| for (number, err) in &errors { | ||
| warn!( | ||
| self.log, | ||
| "validation falied for extent {number}: {err:?}" | ||
| ); | ||
| } | ||
| panic!("validation failed"); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I worry about this panic because we can't recover from it like we can the "check during repair" panic. Eventually the downstairs service will be in maintenance and the Upstairs can't do anything to fix this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given we are "hunting" with this change, and not living with it forever, I think I like that we can't recover. I want things to stop as soon as we have detected a problem. Looking for downstairs in maintenance (as well as core files) would help determine when we saw a problem. I worry that perhaps we have a bad extent, but it gets "repaired" before we find it and we miss our window. |
||
| } | ||
| } | ||
|
|
||
| /// Walk the list of extents and close each one. | ||
|
|
@@ -682,6 +706,28 @@ impl Region { | |
| ); | ||
| } | ||
|
|
||
| // Validate the extent that we just received, before copying it over | ||
| { | ||
| let new_extent = match Extent::open( | ||
| ©_dir, | ||
| &self.def(), | ||
| eid, | ||
| true, // read-only | ||
| &self.log.clone(), | ||
| ) { | ||
| Ok(e) => e, | ||
| Err(e) => { | ||
| panic!( | ||
| "Failed to open live-repair extent {eid} in \ | ||
| {copy_dir:?}: {e:?}" | ||
| ); | ||
| } | ||
| }; | ||
| if let Err(e) = new_extent.validate() { | ||
| panic!("Failed to validate live-repair extent {eid}: {e:?}"); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe when we fail here, it will leave behind a copy_dir. |
||
| } | ||
| } | ||
|
|
||
| // After we have all files: move the repair dir. | ||
| info!( | ||
| self.log, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this so we can verify the incoming copy of an extent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, opening an extent takes the extent's root directory and number, then builds the extent file path internally. This is annoying if we want to open a specific raw file!
Since we're building a test image directly from this PR (and probably not merging it), I didn't bother to clean this up further.