Skip to content

Conversation

@jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Mar 27, 2025

When creating a region's dataset, the Crucible Agent will include a reservation 25% larger than the region's size to account for on-disk overhead (storing encryption contexts and other metadata). Nexus does not take this overhead into account when computing size_used for crucible_dataset rows, or when allocating regions. This leads to the scenario where Nexus thinks there's enough room for a region but the Agent will fail to create the dataset due to not having enough space for the reservation to succeed.

Fix this: add a reservation factor column to the Region model, and account for this when performing region allocation and when computing the size_used column for crucible datasets.

This commit also adds an upgrader that will set all currently allocated Region's reservation factor to 1.25, and recompute all the size_used values for all non-deleted crucible datasets. This may lead to size_used being greater than the pool's total_size - a follow up commit will add an omdb command to identify these cases, and identify candidate regions to request replacement for in order to remedy this.

The regions_hard_delete function now uses this upgrader's CTE to set size_used for all crucible datasets at once, instead of in a for loop during an interactive transaction.

When creating a region's dataset, the Crucible Agent will include a
reservation 25% larger than the region's size to account for on-disk
overhead (storing encryption contexts and other metadata). Nexus does
not take this overhead into account when computing `size_used` for
crucible_dataset rows, or when allocating regions. This leads to the
scenario where Nexus thinks there's enough room for a region but the
Agent will fail to create the dataset due to not having enough space for
the reservation to succeed.

Fix this: add a reservation factor column to the Region model, and
account for this when performing region allocation and when computing
the `size_used` column for crucible datasets.

This commit also adds an upgrader that will set all currently allocated
Region's reservation factor to 1.25, and recompute all the `size_used`
values for all non-deleted crucible datasets. This may lead to
`size_used` being greater than the pool's total_size - a follow up
commit will add an omdb command to identify these cases, and identify
candidate regions to request replacement for in order to remedy this.

The `regions_hard_delete` function now uses this upgrader's CTE to set
`size_used` for all crucible datasets at once, instead of in a for loop
during an interactive transaction.
let requested_size: u64 =
params.block_size * params.blocks_per_extent * params.extent_count;
let size_delta: u64 =
(requested_size as f64 * RESERVATION_FACTOR).round() as u64;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to ceil() here, and elsewhere we call round()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, actually, reading this and the code below, it weirds me out a little bit that we're calculating this information twice, as far as I can tell? See below, in proposed_dataset_changes, seems like we're doing this same calculation again...

maybe this is fine, but it just raised a flag to me -- if we're doing floating point math on accounting, it kinda seems like we should do it "exactly once".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

95696ef removes the floating point stuff, and also uses the size_delta variable in the spot you identified. nice catch :)

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small questions, but the structure here seems reasonable to me. Thanks for adding the data migration test.

)
.execute_async(&conn).await?;
}
// XXX put this file somewhere else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you say, we should probably fix this before merging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I thought about this, and ended up embedding it as a string: 4834bd9. it's going to have to change independent of that update CTE anyway in the future.

deleting BOOL NOT NULL
deleting BOOL NOT NULL,

reservation_factor FLOAT NOT NULL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely could benefit from docs, and it would probably be worth updating the "size_used" field in this file too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13f2a07 adds some notes here, lmk what you think.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by review, but just a few suggestions on the floating-point math, which almost always sucks.

deleting BOOL NOT NULL
deleting BOOL NOT NULL,

reservation_factor FLOAT NOT NULL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we might want a CHECK reservation_factor >= 1.0 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changed to an enum, the value is no longer unrestricted

/// which is some factor higher than the requested region size to account
/// for on-disk overhead.
pub fn reserved_size(&self) -> u64 {
(self.requested_size() as f64 * self.reservation_factor).round() as u64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like ceil() might be safer than round()ing. I'm also not sure how to handle it, but checking for overflow in the conversion back to a u64 seems helpful too. One way might be adding checks in the Region::new() constructor, say that the factor is >= 1.0 and that the reserved size won't overflow a u64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

95696ef removes the floating point math, and restricts the factor via an enum.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, much simpler.

@morlandi7 morlandi7 added this to the 14 milestone Mar 28, 2025
jmpesp added 4 commits March 28, 2025 20:34
reservation right now is 25%, which means that the requested size of a
region can be divided by 4. avoid floating point math where possible.

change the reservation percentage stored with the region to an enum,
where the only value is 25%. this limits what can be done with manual
database edits, and restricts what the Region::reserved_size function
has to guard against.

it'd be nice if Region::new was a test-only function but the crate
doesn't have the same idea of a integration test feature.
it will have to change independent of the schema update anyway
@jmpesp
Copy link
Contributor Author

jmpesp commented Mar 28, 2025

@smklein @bnaecker thanks for the reviews - I've pushed updates so that there's no more floating point math in the db queries or the related rust code, let me know what you think :)

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a 500 if I'm out of space:

18:10:26.013Z INFO 4b3e083b-f9b8-41d9-a48d-a22683a7853e (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga ACTION error at node "regions_ensure": Failed to create region, unexpected state: Failed
    file = /home/alan/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.0/src/server.rs:855
    latency_us = 939684
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:57148
    req_id = 1e314f60-df29-4c7e-8ae3-130ccc33883c
    response_code = 500
    uri = /v1/disks?project=alan

Comment on lines 25 to 29
#[diesel(sql_type = RegionReservationPercentEnum)]
pub enum RegionReservationPercent;

// Enum values
TwentyFive => b"25"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really 1.25%, or it could be called RegionAdditionalReservationPercent (which name I hate, too long)

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nit-picks for you

@jmpesp
Copy link
Contributor Author

jmpesp commented Apr 2, 2025

@leftwo I think the 500 you're seeing is a result of #7902 - the follow-up PR that creates a storage buffer should prevent this 500 from happening, and instead return the much more appropriate 507.

type AllocationQuery =
TypedSqlQuery<(SelectableSql<CrucibleDataset>, SelectableSql<Region>)>;

impl std::fmt::Debug for AllocationQuery {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How load-bearing is this Debug impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed at all, removed in dd6ae81

}

// After the above check, unconditionally cast from u64 to i64. The value is
// low enough that this shouldn't truncate.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I don't think it would truncate either way, I think it would fail with an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after discussing it a bit, went with returning an Err instead of unwrapping in 7d9fed6

// The Crucible Agent's current reservation factor is 25%, so add that here.
// Check first that the requested region size is divisible by this. This
// should basically never fail because all block sizes are divisible by 4.
if requested_size % 4 != 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contingent on the value of reservation_percent, and the enum RegionReservationPercent only having a single value?

If we add another value to that enum, this code will happily still compile (incorrectly), right?

I know this is pedantic, but could we:

  1. Move up the
    let reservation_percent =
        crate::db::model::RegionReservationPercent::TwentyFive;

from below

  1. Run these "mod by four, set factor = 4" checks in a match arm, based on reservation_percent?

Then, if we add another option for RegionReservationPercent, the code will complain about a missing match arm, which would help identify this spot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern is always a good idea, done in 1e47888

logctx.cleanup_successful();
}

#[tokio::test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests!

As a nitpick, neither of these tests needs to be async (they could just be #[test]) if you want 'em to be marginally cheaper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 be6599b

* otherwise ignores this field. It's updated by Nexus as region allocations
* and deletions are performed using this dataset.
*
* Note that the value in this column is _not_ the sum of requested region
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this!!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫡

@jmpesp jmpesp enabled auto-merge (squash) April 3, 2025 19:04
@jmpesp jmpesp merged commit e6b3fea into oxidecomputer:main Apr 3, 2025
16 checks passed
@jmpesp jmpesp deleted the crucible_agent_region_overhead branch April 3, 2025 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants