-
Notifications
You must be signed in to change notification settings - Fork 62
Rework VMM reservoir sizing to scale better with memory configurations #7837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed. The new `control_plane_memory_earmark_mb` setting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same: Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir After: `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir: Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir After: `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets. A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted. Fixes #7448.
| vmm_reservoir_percentage = 86.3 | ||
| # The amount of memory held back for services which exist between zero and one | ||
| # on this Gimlet. This currently includes some additional terms reflecting | ||
| # OS memory use under load. | ||
| # | ||
| # As of writing, this is the sum of the following items from RFD 413: | ||
| # * Network buffer slush: 18 GiB | ||
| # * Other kernel heap: 20 GiB | ||
| # * ZFS ARC minimum: 5 GiB | ||
| # * Sled agent: 0.5 GiB | ||
| # * Maghemite: 0.25 GiB | ||
| # * NTP: 0.25 GiB | ||
| control_plane_memory_earmark_mb = 45056 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one obvious way this is not quite right: ClickHouse, Cockroach, DNS, Oximeter are all missing here, so this misses the premise of "budget enough memory that if we have to move a control plane service here, we don't have to evict a VM to do it". so are dendrite and wicket. i think the "earmark" amount should be closer to 76 GiB given earlier measurements, and the VMM reservoir percentage updated to around 89%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from talking with @faithanalog earlier, it looks like Crucible's kb-per-extent as i see in https://github.com/oxidecomputer/crucible/runs/39057809960 (~91KiB/extent) is a lower bound, whereas she sees as much as 225KiB/extent. that's around 58 GiB of variance all-told.
so, trying to avoid swapping with everything running on a sled here would have us wanting as much as 139 GiB set aside for control plane (95 GiB of Crucibles, 20 GiB of other kernel heap, 18 GiB for expected NIC buffers, the ARC minimum size and then one-per-sled services), with another up-to-40 GiB of services that are only sometimes present like databases, DNS, etc. that in turn would have us sizing the VMM reservoir at around 95% of what's left to keep the actual reservoir size the same, which should be fine as long as no one is making hundreds of 512 MiB instances...
my inclination at this point is we could really dial things in as they are today but we'd end up more brittle if anything changes in the future. we'd be better off connecting the "expected fixed use" term to what the control plane knows what a sled should be running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I'm probably missing something, and it may be that I'm about to agree vehemently with you. But I wonder if it makes sense to start out by changing the computation as you've done in this PR without changing any of its results for a 1 TiB sled:
- Set the earmark to 157,286 MiB: this is 15% of 1 TiB and 7.5% of 2 TiB
- note that I think all the rest of this math includes fixed-size OS costs in this amount, so this may turn out to be a slightly different number in practice
- Set the reservoir to 94% of the remainder
- for a 1 TiB sled, this is 838,860 MiB, or about 80% of total DRAM, which is the current reservoir size
- for a 2 TiB sled, this is 1,823,474 MiB, or just shy of 87% of total DRAM
- Leave the rest as slush
- for a 1 TiB sled, this is 52,430 MiB (5% of the sled total, 6.25% of reservoir)
- for a 2 TiB sled, this is 116,292 MiB (5.5% of the sled total, 6.38% of reservoir)
This would
- make more room for guest memory on the 2 TiB sleds (which is the point)
- leave proportionally the same amount of room for other Propolis memory on each sled type
- start to set up a clearer delineation between "memory reserved for fixed-count control plane services" and "memory used by variable-count services like Propolis VMMs"
- still ensure that the "fixed-count control plane services" bucket is large enough to avoid heavily constraining where these services can be placed (i.e., it's big enough to give us some chance of being able to punt for a while longer on Reconfigurator having to solve for memory constraints)
Most importantly, this preserves the existing reservoir size (and the existing non-reservoir size) for existing sleds, so we don't have to worry about accidentally bumping the reservoir size on those sleds in a way that destabilizes them.
WDYT? Again, it could be that we're vehemently agreeing and that I just needed to work out the math for myself in order to be convinced. Do we actually want to have a smaller earmark here to reflect a belief that most non-reservoir memory is actually used by Propolis and not other services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be that we're vehemently agreeing and that I just needed to work out the math for myself in order to be convinced
i think we are, on the broad strokes!
start to set up a clearer delineation between "memory reserved for fixed-count control plane services" and "memory used by variable-count services like Propolis VMMs"
including Crucible in the earmark makes it a bit weird: given that we've seen much higher (single-digit multiple) instantaneous peaks of memory use during, for example, volume repairs, i don't think estimating from kb/extent actually gives us the right high watermark.
i still agree this is about where we'd want to set the control plane earmark, just that unfortunately it's not as crisp an earmark as we'd want..
Most importantly, this preserves the existing reservoir size (and the existing non-reservoir size) for existing sleds, so we don't have to worry about accidentally bumping the reservoir size on those sleds in a way that destabilizes them.
fully agreed. the numbers i'd picked here do shrink the VMM by about 106 MiB but staying inside the same number of GiB seems like the most important part. tiny nit: the 12 GiB unaddressable region results in illumos reporting 1012 GiB of physical installed memory, so in your numbers you'd want ~95.1% for the reservoir. relevant in a moment..
Do we actually want to have a smaller earmark here to reflect a belief that most non-reservoir memory is actually used by Propolis and not other services?
my use of a smaller earmark here is more because i'm not confident the measured numbers give us a good sense of the actual worst cases - Eliza mentioned migrations, Artemis has mentioned Crucible repairs causing bursty heap behavior, and dogfood uptimes are generally lower so what i do know is that extrapolating from dogfood measurements will underestimate where we'll be after load and uptime.
so, instead of an earmark where we're not sure if it's too high or too low - good arguments for the RFD 413 figures being either, honestly - i set the earmark lower to cover items we can be certain are present on every sled and are.. relatively certain. it's not high enough, but we definitely won't want to set the earmark lower.
my other thought here, which i've realized is too pessimistic as i'm typing it out here, is that a really aggressive VMM reservoir percentage is definitely wrong in the worst case: if you take a 1GiB instance as the minimum size, and Propolis reflecting about 120 MiB of additional allocations, then in the worst case where we're chock full of 1 GiB instances we should have a VMM reservoir no larger than 120 / (1024 + 120) == 89.5% of otherwise-unearmarked memory.
that's too pessimistic though, because you'd be bound on vCPUs first, for 128 Propolis/sled. if the control plane earmark was high enough to warrant a >95% VMM reservoir percentage then i was concerned that the sled could be compelled to OOM. this misunderstanding is the main reason i didn't want to go for the more aggressive budget. incidentally, my thought for a more aggressive budget is what i'd described as my other thought in the RFD 413 refresh (and a lot closer to what you outline).
so, having written that all out, maybe it's better to go that way? i'm not super happy with the control plane earmark being a floating maybe-overestimate maybe-underestimate, but on its own that's not a strong reason to not :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, thanks! To repeat it back a bit, it sounds like we're sort of bucketing memory usage in this fashion:
- Fixed OS costs
- Memory for one-per-sled control plane services with generally stable usage requirements (e.g. NTP)
- Memory for variable-count control plane services (e.g. CockroachDB)
- Memory for possibly-bursty services like Crucible and Propolis
- Reservoir
- Everything else
For now, the control plane earmark includes (1) and (2) but not the others. The reservoir percentage is set so that the size on a 1 TiB Gimlet1 is the same after you account for both the earmark and the page_t database:
- 1 TiB: ((1024 * 1024) - 45056 - (30 * 1024)) * .863 = 839,526 MiB of reservoir (~80% of DRAM)
- 2 TiB: ((2048 * 1024) - 45056 - (60 * 1024)) * .863 = 1,717,936 MiB of reservoir (~82% of DRAM)
If this math checks out, I would suggest summarizing it in a comment above the reservoir percentage here, since the 86.3% figure is (IMO) a bit of a magic number (it's not even an integer, let alone a multiple of ten!!).
Another way to analyze this is to see how much is left over after accounting for page_ts and the reservoir:
- 1 TiB: ((1024 - 30) * 1024) - 839526 = 178,330 MiB left over
- 2 TiB: ((2048 - 60) * 1024) - 1717936 = 317,716 MiB left over
It feels a little odd to leave that extra 136 GiB of RAM on the table on the 2 TiB sleds, but maybe we'll need it when we start live migrating VMs. It seems like the answer to that is not so much to play with the numbers further as to get a handle on buckets (3) and (4) from the list above so that we can include those estimates in our calculations here.
Footnotes
-
The 413 refresh correctly mentions that this might look different once you start getting sleds with different hardware, different numbers of logical processors, etc. I assume we'll find some way to revisit this if it turns out to be important for Cosmo; I note that this file is in the
gimletdirectory, so maybe that just means having a different config TOML. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the answer to that is not so much to play with the numbers further as to get a handle on buckets (3) and (4)
this is exactly where i landed and probably explains the subsequent memory stats motivation lately!
a comment above the reservoir percentage here,
... yeah, totally fair. i didn't even nod to 413 :(
the number of physical pages won't change at runtime really, nor will the size of pages, but it seems a bit nicer this way..
077ea86 to
8039a08
Compare
gjcolombo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to read the 413 refresh, but here are a few thoughts/suggestions.
sled-hardware/src/lib.rs
Outdated
| hardware_physical_ram_bytes | ||
| - max_page_t_space | ||
| - self.control_plane_earmark_bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worthwhile to get a log line in here with the terms that went into calculating this (for debuggability if the "reservoir size exceeds maximum" case in set_reservoir_size is reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, included in f474252
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do still wonder somewhat about possibly bursty memory use during migrations, but that seems like something to investigate later.
EDIT: Ah, you sorta got out ahead of this one with oxidecomputer/propolis#890!
gjcolombo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good from an implementation point of view, but I want to make sure I understand all the new math.
| vmm_reservoir_percentage = 86.3 | ||
| # The amount of memory held back for services which exist between zero and one | ||
| # on this Gimlet. This currently includes some additional terms reflecting | ||
| # OS memory use under load. | ||
| # | ||
| # As of writing, this is the sum of the following items from RFD 413: | ||
| # * Network buffer slush: 18 GiB | ||
| # * Other kernel heap: 20 GiB | ||
| # * ZFS ARC minimum: 5 GiB | ||
| # * Sled agent: 0.5 GiB | ||
| # * Maghemite: 0.25 GiB | ||
| # * NTP: 0.25 GiB | ||
| control_plane_memory_earmark_mb = 45056 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I'm probably missing something, and it may be that I'm about to agree vehemently with you. But I wonder if it makes sense to start out by changing the computation as you've done in this PR without changing any of its results for a 1 TiB sled:
- Set the earmark to 157,286 MiB: this is 15% of 1 TiB and 7.5% of 2 TiB
- note that I think all the rest of this math includes fixed-size OS costs in this amount, so this may turn out to be a slightly different number in practice
- Set the reservoir to 94% of the remainder
- for a 1 TiB sled, this is 838,860 MiB, or about 80% of total DRAM, which is the current reservoir size
- for a 2 TiB sled, this is 1,823,474 MiB, or just shy of 87% of total DRAM
- Leave the rest as slush
- for a 1 TiB sled, this is 52,430 MiB (5% of the sled total, 6.25% of reservoir)
- for a 2 TiB sled, this is 116,292 MiB (5.5% of the sled total, 6.38% of reservoir)
This would
- make more room for guest memory on the 2 TiB sleds (which is the point)
- leave proportionally the same amount of room for other Propolis memory on each sled type
- start to set up a clearer delineation between "memory reserved for fixed-count control plane services" and "memory used by variable-count services like Propolis VMMs"
- still ensure that the "fixed-count control plane services" bucket is large enough to avoid heavily constraining where these services can be placed (i.e., it's big enough to give us some chance of being able to punt for a while longer on Reconfigurator having to solve for memory constraints)
Most importantly, this preserves the existing reservoir size (and the existing non-reservoir size) for existing sleds, so we don't have to worry about accidentally bumping the reservoir size on those sleds in a way that destabilizes them.
WDYT? Again, it could be that we're vehemently agreeing and that I just needed to work out the math for myself in order to be convinced. Do we actually want to have a smaller earmark here to reflect a belief that most non-reservoir memory is actually used by Propolis and not other services?
Co-authored-by: Eliza Weisman <[email protected]>
Co-authored-by: Eliza Weisman <[email protected]>
gjcolombo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me--thanks for working through all the math in the comments! I have a couple additional suggestions for annotating the TOML files but I think this looks good on the overall.
| vmm_reservoir_percentage = 86.3 | ||
| # The amount of memory held back for services which exist between zero and one | ||
| # on this Gimlet. This currently includes some additional terms reflecting | ||
| # OS memory use under load. | ||
| # | ||
| # As of writing, this is the sum of the following items from RFD 413: | ||
| # * Network buffer slush: 18 GiB | ||
| # * Other kernel heap: 20 GiB | ||
| # * ZFS ARC minimum: 5 GiB | ||
| # * Sled agent: 0.5 GiB | ||
| # * Maghemite: 0.25 GiB | ||
| # * NTP: 0.25 GiB | ||
| control_plane_memory_earmark_mb = 45056 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, thanks! To repeat it back a bit, it sounds like we're sort of bucketing memory usage in this fashion:
- Fixed OS costs
- Memory for one-per-sled control plane services with generally stable usage requirements (e.g. NTP)
- Memory for variable-count control plane services (e.g. CockroachDB)
- Memory for possibly-bursty services like Crucible and Propolis
- Reservoir
- Everything else
For now, the control plane earmark includes (1) and (2) but not the others. The reservoir percentage is set so that the size on a 1 TiB Gimlet1 is the same after you account for both the earmark and the page_t database:
- 1 TiB: ((1024 * 1024) - 45056 - (30 * 1024)) * .863 = 839,526 MiB of reservoir (~80% of DRAM)
- 2 TiB: ((2048 * 1024) - 45056 - (60 * 1024)) * .863 = 1,717,936 MiB of reservoir (~82% of DRAM)
If this math checks out, I would suggest summarizing it in a comment above the reservoir percentage here, since the 86.3% figure is (IMO) a bit of a magic number (it's not even an integer, let alone a multiple of ten!!).
Another way to analyze this is to see how much is left over after accounting for page_ts and the reservoir:
- 1 TiB: ((1024 - 30) * 1024) - 839526 = 178,330 MiB left over
- 2 TiB: ((2048 - 60) * 1024) - 1717936 = 317,716 MiB left over
It feels a little odd to leave that extra 136 GiB of RAM on the table on the 2 TiB sleds, but maybe we'll need it when we start live migrating VMs. It seems like the answer to that is not so much to play with the numbers further as to get a handle on buckets (3) and (4) from the list above so that we can include those estimates in our calculations here.
Footnotes
-
The 413 refresh correctly mentions that this might look different once you start getting sleds with different hardware, different numbers of logical processors, etc. I assume we'll find some way to revisit this if it turns out to be important for Cosmo; I note that this file is in the
gimletdirectory, so maybe that just means having a different config TOML. ↩
|
gave this a looksee on here's a 1T sled in dogfood: so that's a baseline. 1T in dublin: and 2T in dublin: that is to say,
these numbers are different but pretty close to what i'd calculated at the start. the difference is math errors in the message. the on the whole it looks right so i'll fix up the message and merge this in the morning. |
pairs well with this refresh of RFD 413 where i worked through the math for this approach.
The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed.
The new
control_plane_memory_earmark_mbsetting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same:Before:
1012 * 0.8 => 809.6 GiBof VMM reservoirAfter:
(1012 - (30.0 / 1024 * 1012) - 44) * 0.863 => 809.797 GiBof VMM reservoirA Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:
Before:
2036 * 0.8 => 1628.8 GiBof VMM reservoirAfter:
(2036 - (60.0 / 2048 * 2036) - 44) * 0.863 => 1667.62 GiBof VMM reservoirThese actual observed figures are close-but-not-exact because the amount of physical memory illumos reports looks to be about 25 MiB less.
A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets.
A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted.
Fixes #7448.