Skip to content

Conversation

@iximeow
Copy link
Member

@iximeow iximeow commented Mar 19, 2025

pairs well with this refresh of RFD 413 where i worked through the math for this approach.

The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed.

The new control_plane_memory_earmark_mb setting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same:

Before: 1012 * 0.8 => 809.6 GiB of VMM reservoir
After: (1012 - (30.0 / 1024 * 1012) - 44) * 0.863 => 809.797 GiB of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: 2036 * 0.8 => 1628.8 GiB of VMM reservoir
After: (2036 - (60.0 / 2048 * 2036) - 44) * 0.863 => 1667.62 GiB of VMM reservoir

These actual observed figures are close-but-not-exact because the amount of physical memory illumos reports looks to be about 25 MiB less.

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted.

Fixes #7448.

The core observation of this change is that some uses of memory are
relatively fixed regardless of a sled's hardware configuration. By
subtracting these more constrained uses of memory before calculating a
VMM reservoir size, the remaining memory will be used mostly for
services that scale either with the amount of physical memory or the
amount of storage installed.

The new `control_plane_memory_earmark_mb` setting for sled-agent
describes the sum of this fixed allocation, and existing sled-agent
config.toml files are updated so that actual VMM reservoir sizes for
Gimlets with 1TB of installed memory are about the same:

Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir
After:  `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir
After:  `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir,
but this is in some sense correct: we would otherwise "overprovision"
the VMM reservoir and eat into what is currently effectively a slush
fund of memory for Oxide services supporting the rack's operation,
risking overall system stability if inferring from observation and
testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable
across SKUs" would be to measure Crucible overhead in the context of
number of disks or total installed storage. Then we could calculate the
VMM reservoir after subtracting the maximum memory expected to be used
by Crucible if all storage was allocated, and have a presumably-higher
VMM reservoir percentage for the yet-smaller slice of system memory that
is not otherwise accounted.

Fixes #7448.
Comment on lines 23 to 35
vmm_reservoir_percentage = 86.3
# The amount of memory held back for services which exist between zero and one
# on this Gimlet. This currently includes some additional terms reflecting
# OS memory use under load.
#
# As of writing, this is the sum of the following items from RFD 413:
# * Network buffer slush: 18 GiB
# * Other kernel heap: 20 GiB
# * ZFS ARC minimum: 5 GiB
# * Sled agent: 0.5 GiB
# * Maghemite: 0.25 GiB
# * NTP: 0.25 GiB
control_plane_memory_earmark_mb = 45056
Copy link
Member Author

@iximeow iximeow Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one obvious way this is not quite right: ClickHouse, Cockroach, DNS, Oximeter are all missing here, so this misses the premise of "budget enough memory that if we have to move a control plane service here, we don't have to evict a VM to do it". so are dendrite and wicket. i think the "earmark" amount should be closer to 76 GiB given earlier measurements, and the VMM reservoir percentage updated to around 89%

Copy link
Member Author

@iximeow iximeow Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from talking with @faithanalog earlier, it looks like Crucible's kb-per-extent as i see in https://github.com/oxidecomputer/crucible/runs/39057809960 (~91KiB/extent) is a lower bound, whereas she sees as much as 225KiB/extent. that's around 58 GiB of variance all-told.

so, trying to avoid swapping with everything running on a sled here would have us wanting as much as 139 GiB set aside for control plane (95 GiB of Crucibles, 20 GiB of other kernel heap, 18 GiB for expected NIC buffers, the ARC minimum size and then one-per-sled services), with another up-to-40 GiB of services that are only sometimes present like databases, DNS, etc. that in turn would have us sizing the VMM reservoir at around 95% of what's left to keep the actual reservoir size the same, which should be fine as long as no one is making hundreds of 512 MiB instances...

my inclination at this point is we could really dial things in as they are today but we'd end up more brittle if anything changes in the future. we'd be better off connecting the "expected fixed use" term to what the control plane knows what a sled should be running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm probably missing something, and it may be that I'm about to agree vehemently with you. But I wonder if it makes sense to start out by changing the computation as you've done in this PR without changing any of its results for a 1 TiB sled:

  • Set the earmark to 157,286 MiB: this is 15% of 1 TiB and 7.5% of 2 TiB
    • note that I think all the rest of this math includes fixed-size OS costs in this amount, so this may turn out to be a slightly different number in practice
  • Set the reservoir to 94% of the remainder
    • for a 1 TiB sled, this is 838,860 MiB, or about 80% of total DRAM, which is the current reservoir size
    • for a 2 TiB sled, this is 1,823,474 MiB, or just shy of 87% of total DRAM
  • Leave the rest as slush
    • for a 1 TiB sled, this is 52,430 MiB (5% of the sled total, 6.25% of reservoir)
    • for a 2 TiB sled, this is 116,292 MiB (5.5% of the sled total, 6.38% of reservoir)

This would

  • make more room for guest memory on the 2 TiB sleds (which is the point)
  • leave proportionally the same amount of room for other Propolis memory on each sled type
  • start to set up a clearer delineation between "memory reserved for fixed-count control plane services" and "memory used by variable-count services like Propolis VMMs"
  • still ensure that the "fixed-count control plane services" bucket is large enough to avoid heavily constraining where these services can be placed (i.e., it's big enough to give us some chance of being able to punt for a while longer on Reconfigurator having to solve for memory constraints)

Most importantly, this preserves the existing reservoir size (and the existing non-reservoir size) for existing sleds, so we don't have to worry about accidentally bumping the reservoir size on those sleds in a way that destabilizes them.

WDYT? Again, it could be that we're vehemently agreeing and that I just needed to work out the math for myself in order to be convinced. Do we actually want to have a smaller earmark here to reflect a belief that most non-reservoir memory is actually used by Propolis and not other services?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be that we're vehemently agreeing and that I just needed to work out the math for myself in order to be convinced

i think we are, on the broad strokes!

start to set up a clearer delineation between "memory reserved for fixed-count control plane services" and "memory used by variable-count services like Propolis VMMs"

including Crucible in the earmark makes it a bit weird: given that we've seen much higher (single-digit multiple) instantaneous peaks of memory use during, for example, volume repairs, i don't think estimating from kb/extent actually gives us the right high watermark.

i still agree this is about where we'd want to set the control plane earmark, just that unfortunately it's not as crisp an earmark as we'd want..

Most importantly, this preserves the existing reservoir size (and the existing non-reservoir size) for existing sleds, so we don't have to worry about accidentally bumping the reservoir size on those sleds in a way that destabilizes them.

fully agreed. the numbers i'd picked here do shrink the VMM by about 106 MiB but staying inside the same number of GiB seems like the most important part. tiny nit: the 12 GiB unaddressable region results in illumos reporting 1012 GiB of physical installed memory, so in your numbers you'd want ~95.1% for the reservoir. relevant in a moment..

Do we actually want to have a smaller earmark here to reflect a belief that most non-reservoir memory is actually used by Propolis and not other services?

my use of a smaller earmark here is more because i'm not confident the measured numbers give us a good sense of the actual worst cases - Eliza mentioned migrations, Artemis has mentioned Crucible repairs causing bursty heap behavior, and dogfood uptimes are generally lower so what i do know is that extrapolating from dogfood measurements will underestimate where we'll be after load and uptime.

so, instead of an earmark where we're not sure if it's too high or too low - good arguments for the RFD 413 figures being either, honestly - i set the earmark lower to cover items we can be certain are present on every sled and are.. relatively certain. it's not high enough, but we definitely won't want to set the earmark lower.

my other thought here, which i've realized is too pessimistic as i'm typing it out here, is that a really aggressive VMM reservoir percentage is definitely wrong in the worst case: if you take a 1GiB instance as the minimum size, and Propolis reflecting about 120 MiB of additional allocations, then in the worst case where we're chock full of 1 GiB instances we should have a VMM reservoir no larger than 120 / (1024 + 120) == 89.5% of otherwise-unearmarked memory.

that's too pessimistic though, because you'd be bound on vCPUs first, for 128 Propolis/sled. if the control plane earmark was high enough to warrant a >95% VMM reservoir percentage then i was concerned that the sled could be compelled to OOM. this misunderstanding is the main reason i didn't want to go for the more aggressive budget. incidentally, my thought for a more aggressive budget is what i'd described as my other thought in the RFD 413 refresh (and a lot closer to what you outline).

so, having written that all out, maybe it's better to go that way? i'm not super happy with the control plane earmark being a floating maybe-overestimate maybe-underestimate, but on its own that's not a strong reason to not :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, thanks! To repeat it back a bit, it sounds like we're sort of bucketing memory usage in this fashion:

  1. Fixed OS costs
  2. Memory for one-per-sled control plane services with generally stable usage requirements (e.g. NTP)
  3. Memory for variable-count control plane services (e.g. CockroachDB)
  4. Memory for possibly-bursty services like Crucible and Propolis
  5. Reservoir
  6. Everything else

For now, the control plane earmark includes (1) and (2) but not the others. The reservoir percentage is set so that the size on a 1 TiB Gimlet1 is the same after you account for both the earmark and the page_t database:

  • 1 TiB: ((1024 * 1024) - 45056 - (30 * 1024)) * .863 = 839,526 MiB of reservoir (~80% of DRAM)
  • 2 TiB: ((2048 * 1024) - 45056 - (60 * 1024)) * .863 = 1,717,936 MiB of reservoir (~82% of DRAM)

If this math checks out, I would suggest summarizing it in a comment above the reservoir percentage here, since the 86.3% figure is (IMO) a bit of a magic number (it's not even an integer, let alone a multiple of ten!!).


Another way to analyze this is to see how much is left over after accounting for page_ts and the reservoir:

  • 1 TiB: ((1024 - 30) * 1024) - 839526 = 178,330 MiB left over
  • 2 TiB: ((2048 - 60) * 1024) - 1717936 = 317,716 MiB left over

It feels a little odd to leave that extra 136 GiB of RAM on the table on the 2 TiB sleds, but maybe we'll need it when we start live migrating VMs. It seems like the answer to that is not so much to play with the numbers further as to get a handle on buckets (3) and (4) from the list above so that we can include those estimates in our calculations here.

Footnotes

  1. The 413 refresh correctly mentions that this might look different once you start getting sleds with different hardware, different numbers of logical processors, etc. I assume we'll find some way to revisit this if it turns out to be important for Cosmo; I note that this file is in the gimlet directory, so maybe that just means having a different config TOML.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the answer to that is not so much to play with the numbers further as to get a handle on buckets (3) and (4)

this is exactly where i landed and probably explains the subsequent memory stats motivation lately!

a comment above the reservoir percentage here,

... yeah, totally fair. i didn't even nod to 413 :(

the number of physical pages won't change at runtime really, nor will
the size of pages, but it seems a bit nicer this way..
@iximeow iximeow force-pushed the ixi/revised-reservoir-calculations branch from 077ea86 to 8039a08 Compare March 20, 2025 23:58
@iximeow iximeow marked this pull request as ready for review March 21, 2025 01:47
@morlandi7 morlandi7 added this to the 14 milestone Mar 25, 2025
@gjcolombo gjcolombo self-requested a review March 27, 2025 20:10
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to read the 413 refresh, but here are a few thoughts/suggestions.

Comment on lines 133 to 135
hardware_physical_ram_bytes
- max_page_t_space
- self.control_plane_earmark_bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worthwhile to get a log line in here with the terms that went into calculating this (for debuggability if the "reservoir size exceeds maximum" case in set_reservoir_size is reached.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, included in f474252

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still wonder somewhat about possibly bursty memory use during migrations, but that seems like something to investigate later.

EDIT: Ah, you sorta got out ahead of this one with oxidecomputer/propolis#890!

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good from an implementation point of view, but I want to make sure I understand all the new math.

Comment on lines 23 to 35
vmm_reservoir_percentage = 86.3
# The amount of memory held back for services which exist between zero and one
# on this Gimlet. This currently includes some additional terms reflecting
# OS memory use under load.
#
# As of writing, this is the sum of the following items from RFD 413:
# * Network buffer slush: 18 GiB
# * Other kernel heap: 20 GiB
# * ZFS ARC minimum: 5 GiB
# * Sled agent: 0.5 GiB
# * Maghemite: 0.25 GiB
# * NTP: 0.25 GiB
control_plane_memory_earmark_mb = 45056
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm probably missing something, and it may be that I'm about to agree vehemently with you. But I wonder if it makes sense to start out by changing the computation as you've done in this PR without changing any of its results for a 1 TiB sled:

  • Set the earmark to 157,286 MiB: this is 15% of 1 TiB and 7.5% of 2 TiB
    • note that I think all the rest of this math includes fixed-size OS costs in this amount, so this may turn out to be a slightly different number in practice
  • Set the reservoir to 94% of the remainder
    • for a 1 TiB sled, this is 838,860 MiB, or about 80% of total DRAM, which is the current reservoir size
    • for a 2 TiB sled, this is 1,823,474 MiB, or just shy of 87% of total DRAM
  • Leave the rest as slush
    • for a 1 TiB sled, this is 52,430 MiB (5% of the sled total, 6.25% of reservoir)
    • for a 2 TiB sled, this is 116,292 MiB (5.5% of the sled total, 6.38% of reservoir)

This would

  • make more room for guest memory on the 2 TiB sleds (which is the point)
  • leave proportionally the same amount of room for other Propolis memory on each sled type
  • start to set up a clearer delineation between "memory reserved for fixed-count control plane services" and "memory used by variable-count services like Propolis VMMs"
  • still ensure that the "fixed-count control plane services" bucket is large enough to avoid heavily constraining where these services can be placed (i.e., it's big enough to give us some chance of being able to punt for a while longer on Reconfigurator having to solve for memory constraints)

Most importantly, this preserves the existing reservoir size (and the existing non-reservoir size) for existing sleds, so we don't have to worry about accidentally bumping the reservoir size on those sleds in a way that destabilizes them.

WDYT? Again, it could be that we're vehemently agreeing and that I just needed to work out the math for myself in order to be convinced. Do we actually want to have a smaller earmark here to reflect a belief that most non-reservoir memory is actually used by Propolis and not other services?

iximeow and others added 2 commits April 1, 2025 16:07
Co-authored-by: Eliza Weisman <[email protected]>
Co-authored-by: Eliza Weisman <[email protected]>
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me--thanks for working through all the math in the comments! I have a couple additional suggestions for annotating the TOML files but I think this looks good on the overall.

Comment on lines 23 to 35
vmm_reservoir_percentage = 86.3
# The amount of memory held back for services which exist between zero and one
# on this Gimlet. This currently includes some additional terms reflecting
# OS memory use under load.
#
# As of writing, this is the sum of the following items from RFD 413:
# * Network buffer slush: 18 GiB
# * Other kernel heap: 20 GiB
# * ZFS ARC minimum: 5 GiB
# * Sled agent: 0.5 GiB
# * Maghemite: 0.25 GiB
# * NTP: 0.25 GiB
control_plane_memory_earmark_mb = 45056
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, thanks! To repeat it back a bit, it sounds like we're sort of bucketing memory usage in this fashion:

  1. Fixed OS costs
  2. Memory for one-per-sled control plane services with generally stable usage requirements (e.g. NTP)
  3. Memory for variable-count control plane services (e.g. CockroachDB)
  4. Memory for possibly-bursty services like Crucible and Propolis
  5. Reservoir
  6. Everything else

For now, the control plane earmark includes (1) and (2) but not the others. The reservoir percentage is set so that the size on a 1 TiB Gimlet1 is the same after you account for both the earmark and the page_t database:

  • 1 TiB: ((1024 * 1024) - 45056 - (30 * 1024)) * .863 = 839,526 MiB of reservoir (~80% of DRAM)
  • 2 TiB: ((2048 * 1024) - 45056 - (60 * 1024)) * .863 = 1,717,936 MiB of reservoir (~82% of DRAM)

If this math checks out, I would suggest summarizing it in a comment above the reservoir percentage here, since the 86.3% figure is (IMO) a bit of a magic number (it's not even an integer, let alone a multiple of ten!!).


Another way to analyze this is to see how much is left over after accounting for page_ts and the reservoir:

  • 1 TiB: ((1024 - 30) * 1024) - 839526 = 178,330 MiB left over
  • 2 TiB: ((2048 - 60) * 1024) - 1717936 = 317,716 MiB left over

It feels a little odd to leave that extra 136 GiB of RAM on the table on the 2 TiB sleds, but maybe we'll need it when we start live migrating VMs. It seems like the answer to that is not so much to play with the numbers further as to get a handle on buckets (3) and (4) from the list above so that we can include those estimates in our calculations here.

Footnotes

  1. The 413 refresh correctly mentions that this might look different once you start getting sleds with different hardware, different numbers of logical processors, etc. I assume we'll find some way to revisit this if it turns out to be important for Cosmo; I note that this file is in the gimlet directory, so maybe that just means having a different config TOML.

@iximeow
Copy link
Member Author

iximeow commented Apr 3, 2025

gave this a looksee on dublin to make sure numbers are as expected, and they're.. close but not exactly the same?

here's a 1T sled in dogfood:

# mdb -ke ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   13693825             53491    5%
Boot pages                     13                 0    0%
ZFS File Data            27028158            105578   10%
VMM Memory              212228096            829016   80%
Anon                      3270065             12773    1%
Exec and libs              137040               535    0%
Page cache                   6992                27    0%
Free (cachelist)             8183                31    0%
Free (freelist)           8910955             34808    3%

Total                   265283327           1036262
Physical                265283325           1036262

so that's a baseline.

1T in dublin:

# mdb -ke '::memstat'
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   11104283             43376    4%
Boot pages                     13                 0    0%
ZFS File Data             3273216             12786    1%
VMM Memory              212279808            829218   80%
Anon                      3147407             12294    1%
Exec and libs               96915               378    0%
Page cache                  16362                63    0%
Free (cachelist)            39468               154    0%
Free (freelist)          35325855            137991   13%

Total                   265283327           1036262
Physical                265283325           1036262

and 2T in dublin:

Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   21037060             82176    4%
Boot pages                     13                 0    0%
ZFS File Data             3461993             13523    1%
VMM Memory              437152768           1707628   82%
Anon                       385195              1504    0%
Exec and libs              115280               450    0%
Page cache                   6891                26    0%
Free (cachelist)            26142               102    0%
Free (freelist)          71533441            279427   13%

Total                   533718783           2084838
Physical                533718781           2084838

that is to say,

  • 1T: 809.78 GiB of reservoir
  • 2T: 1667.61 GiB of reservoir

these numbers are different but pretty close to what i'd calculated at the start. the difference is math errors in the message. the 30 and 60 GiB page_t expectations are not what we'd actually see - those do not consider the 12GiB of pages that we won't ever have page_t for. then, on the 2 TiB gimlet, it turns out the 12 GiB unaddressable is still unaddressable, and illumos sees only 2035.97 GiB of physical memory.

on the whole it looks right so i'll fix up the message and merge this in the morning.

@iximeow iximeow merged commit 77c4136 into main Apr 3, 2025
16 checks passed
@iximeow iximeow deleted the ixi/revised-reservoir-calculations branch April 3, 2025 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revise reservoir calculations for sleds with larger DRAM

7 participants