-
Notifications
You must be signed in to change notification settings - Fork 62
Account for Crucible Agent reservation overhead #7885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for Crucible Agent reservation overhead #7885
Conversation
When creating a region's dataset, the Crucible Agent will include a reservation 25% larger than the region's size to account for on-disk overhead (storing encryption contexts and other metadata). Nexus does not take this overhead into account when computing `size_used` for crucible_dataset rows, or when allocating regions. This leads to the scenario where Nexus thinks there's enough room for a region but the Agent will fail to create the dataset due to not having enough space for the reservation to succeed. Fix this: add a reservation factor column to the Region model, and account for this when performing region allocation and when computing the `size_used` column for crucible datasets. This commit also adds an upgrader that will set all currently allocated Region's reservation factor to 1.25, and recompute all the `size_used` values for all non-deleted crucible datasets. This may lead to `size_used` being greater than the pool's total_size - a follow up commit will add an omdb command to identify these cases, and identify candidate regions to request replacement for in order to remedy this. The `regions_hard_delete` function now uses this upgrader's CTE to set `size_used` for all crucible datasets at once, instead of in a for loop during an interactive transaction.
| let requested_size: u64 = | ||
| params.block_size * params.blocks_per_extent * params.extent_count; | ||
| let size_delta: u64 = | ||
| (requested_size as f64 * RESERVATION_FACTOR).round() as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to ceil() here, and elsewhere we call round()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, actually, reading this and the code below, it weirds me out a little bit that we're calculating this information twice, as far as I can tell? See below, in proposed_dataset_changes, seems like we're doing this same calculation again...
maybe this is fine, but it just raised a flag to me -- if we're doing floating point math on accounting, it kinda seems like we should do it "exactly once".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
95696ef removes the floating point stuff, and also uses the size_delta variable in the spot you identified. nice catch :)
smklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small questions, but the structure here seems reasonable to me. Thanks for adding the data migration test.
| ) | ||
| .execute_async(&conn).await?; | ||
| } | ||
| // XXX put this file somewhere else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you say, we should probably fix this before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I thought about this, and ended up embedding it as a string: 4834bd9. it's going to have to change independent of that update CTE anyway in the future.
schema/crdb/dbinit.sql
Outdated
| deleting BOOL NOT NULL | ||
| deleting BOOL NOT NULL, | ||
|
|
||
| reservation_factor FLOAT NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely could benefit from docs, and it would probably be worth updating the "size_used" field in this file too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13f2a07 adds some notes here, lmk what you think.
bnaecker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by review, but just a few suggestions on the floating-point math, which almost always sucks.
schema/crdb/dbinit.sql
Outdated
| deleting BOOL NOT NULL | ||
| deleting BOOL NOT NULL, | ||
|
|
||
| reservation_factor FLOAT NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we might want a CHECK reservation_factor >= 1.0 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changed to an enum, the value is no longer unrestricted
nexus/db-model/src/region.rs
Outdated
| /// which is some factor higher than the requested region size to account | ||
| /// for on-disk overhead. | ||
| pub fn reserved_size(&self) -> u64 { | ||
| (self.requested_size() as f64 * self.reservation_factor).round() as u64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like ceil() might be safer than round()ing. I'm also not sure how to handle it, but checking for overflow in the conversion back to a u64 seems helpful too. One way might be adding checks in the Region::new() constructor, say that the factor is >= 1.0 and that the reserved size won't overflow a u64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
95696ef removes the floating point math, and restricts the factor via an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, much simpler.
reservation right now is 25%, which means that the requested size of a region can be divided by 4. avoid floating point math where possible. change the reservation percentage stored with the region to an enum, where the only value is 25%. this limits what can be done with manual database edits, and restricts what the Region::reserved_size function has to guard against. it'd be nice if Region::new was a test-only function but the crate doesn't have the same idea of a integration test feature.
it will have to change independent of the schema update anyway
leftwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a 500 if I'm out of space:
18:10:26.013Z INFO 4b3e083b-f9b8-41d9-a48d-a22683a7853e (dropshot_external): request completed
error_message_external = Internal Server Error
error_message_internal = saga ACTION error at node "regions_ensure": Failed to create region, unexpected state: Failed
file = /home/alan/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.0/src/server.rs:855
latency_us = 939684
local_addr = 172.30.2.5:80
method = POST
remote_addr = 192.168.1.199:57148
req_id = 1e314f60-df29-4c7e-8ae3-130ccc33883c
response_code = 500
uri = /v1/disks?project=alan
nexus/db-model/src/region.rs
Outdated
| #[diesel(sql_type = RegionReservationPercentEnum)] | ||
| pub enum RegionReservationPercent; | ||
|
|
||
| // Enum values | ||
| TwentyFive => b"25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really 1.25%, or it could be called RegionAdditionalReservationPercent (which name I hate, too long)
leftwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nit-picks for you
| type AllocationQuery = | ||
| TypedSqlQuery<(SelectableSql<CrucibleDataset>, SelectableSql<Region>)>; | ||
|
|
||
| impl std::fmt::Debug for AllocationQuery { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How load-bearing is this Debug impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed at all, removed in dd6ae81
| } | ||
|
|
||
| // After the above check, unconditionally cast from u64 to i64. The value is | ||
| // low enough that this shouldn't truncate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I don't think it would truncate either way, I think it would fail with an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after discussing it a bit, went with returning an Err instead of unwrapping in 7d9fed6
| // The Crucible Agent's current reservation factor is 25%, so add that here. | ||
| // Check first that the requested region size is divisible by this. This | ||
| // should basically never fail because all block sizes are divisible by 4. | ||
| if requested_size % 4 != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is contingent on the value of reservation_percent, and the enum RegionReservationPercent only having a single value?
If we add another value to that enum, this code will happily still compile (incorrectly), right?
I know this is pedantic, but could we:
- Move up the
let reservation_percent =
crate::db::model::RegionReservationPercent::TwentyFive;
from below
- Run these "mod by four, set factor = 4" checks in a match arm, based on
reservation_percent?
Then, if we add another option for RegionReservationPercent, the code will complain about a missing match arm, which would help identify this spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pattern is always a good idea, done in 1e47888
| logctx.cleanup_successful(); | ||
| } | ||
|
|
||
| #[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests!
As a nitpick, neither of these tests needs to be async (they could just be #[test]) if you want 'em to be marginally cheaper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 be6599b
| * otherwise ignores this field. It's updated by Nexus as region allocations | ||
| * and deletions are performed using this dataset. | ||
| * | ||
| * Note that the value in this column is _not_ the sum of requested region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
When creating a region's dataset, the Crucible Agent will include a reservation 25% larger than the region's size to account for on-disk overhead (storing encryption contexts and other metadata). Nexus does not take this overhead into account when computing
size_usedfor crucible_dataset rows, or when allocating regions. This leads to the scenario where Nexus thinks there's enough room for a region but the Agent will fail to create the dataset due to not having enough space for the reservation to succeed.Fix this: add a reservation factor column to the Region model, and account for this when performing region allocation and when computing the
size_usedcolumn for crucible datasets.This commit also adds an upgrader that will set all currently allocated Region's reservation factor to 1.25, and recompute all the
size_usedvalues for all non-deleted crucible datasets. This may lead tosize_usedbeing greater than the pool's total_size - a follow up commit will add an omdb command to identify these cases, and identify candidate regions to request replacement for in order to remedy this.The
regions_hard_deletefunction now uses this upgrader's CTE to setsize_usedfor all crucible datasets at once, instead of in a for loop during an interactive transaction.