Skip to content

Commit 77c4136

Browse files
authored
Rework VMM reservoir sizing to scale better with memory configurations (#7837)
The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed. The new `control_plane_memory_earmark_mb` setting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same: Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir After: (1012 - (30.0 / 1024 * 1012) - 44) * 0.863 => 809.797 GiB` of VMM reservoir A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir: Before: `2036 * 0.8 => 1628.8 GiB` of VMM reservoir After: `(2036 - (60.0 / 2048 * 2036) - 44) * 0.863 => 1667.62 GiB` of VMM reservoir These actual observed figures are close-but-not-exact because the amount of physical memory illumos reports looks to be about 25 MiB less. A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets. A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted. Fixes #7448.
1 parent 1a1f2c6 commit 77c4136

File tree

10 files changed

+239
-56
lines changed

10 files changed

+239
-56
lines changed

sled-agent/src/config.rs

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,23 @@ pub struct Config {
5757
pub sled_mode: SledMode,
5858
// TODO: Remove once this can be auto-detected.
5959
pub sidecar_revision: SidecarRevision,
60-
/// Optional percentage of DRAM to reserve for guest memory
61-
pub vmm_reservoir_percentage: Option<u8>,
60+
/// Optional percentage of otherwise-unbudgeted DRAM to reserve for guest
61+
/// memory, after accounting for expected host OS memory consumption and, if
62+
/// set, `vmm_reservoir_size_mb`.
63+
pub vmm_reservoir_percentage: Option<f32>,
6264
/// Optional DRAM to reserve for guest memory in MiB (mutually exclusive
63-
/// option with vmm_reservoir_percentage).
65+
/// option with vmm_reservoir_percentage). This can be at most the amount of
66+
/// otherwise-unbudgeted memory on the slde - a setting high enough to
67+
/// oversubscribe physical memory results in a `sled-agent` error at
68+
/// startup.
6469
pub vmm_reservoir_size_mb: Option<u32>,
70+
/// Amount of memory to set aside in anticipation of use for services that
71+
/// will have roughly constant memory use. These are services that may have
72+
/// zero to one instances on a given sled - internal DNS, MGS, Nexus,
73+
/// ClickHouse, and so on. For a sled that happens to not run these kinds of
74+
/// control plane services, this memory is "wasted", but ensures the sled
75+
/// could run those services if reconfiguration desired it.
76+
pub control_plane_memory_earmark_mb: Option<u32>,
6577
/// Optional swap device size in GiB
6678
pub swap_device_size_gb: Option<u32>,
6779
/// Optional VLAN ID to be used for tagging guest VNICs.
@@ -181,8 +193,8 @@ mod test {
181193
let entry = entry.unwrap();
182194
if entry.file_name() == "config.toml" {
183195
let path = entry.path();
184-
Config::from_file(&path).unwrap_or_else(|_| {
185-
panic!("Failed to parse config {path}")
196+
Config::from_file(&path).unwrap_or_else(|e| {
197+
panic!("Failed to parse config {path}: {e}")
186198
});
187199
configs_seen += 1;
188200
}

sled-agent/src/sled_agent.rs

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ use sled_agent_types::zone_bundle::{
6969
PriorityOrder, StorageLimit, ZoneBundleMetadata,
7070
};
7171
use sled_diagnostics::{SledDiagnosticsCmdError, SledDiagnosticsCmdOutput};
72-
use sled_hardware::{HardwareManager, underlay};
72+
use sled_hardware::{HardwareManager, MemoryReservations, underlay};
7373
use sled_hardware_types::Baseboard;
7474
use sled_hardware_types::underlay::BootstrapInterface;
7575
use sled_storage::manager::StorageHandle;
@@ -495,18 +495,25 @@ impl SledAgent {
495495
*sled_address.ip(),
496496
);
497497

498+
// The VMM reservoir is configured with respect to what's left after
499+
// accounting for relatively fixed and predictable uses.
500+
// We expect certain amounts of memory to be set aside for kernel,
501+
// buffer, or control plane uses.
502+
let memory_sizes = MemoryReservations::new(
503+
parent_log.new(o!("component" => "MemoryReservations")),
504+
long_running_task_handles.hardware_manager.clone(),
505+
config.control_plane_memory_earmark_mb,
506+
);
507+
498508
// Configure the VMM reservoir as either a percentage of DRAM or as an
499509
// exact size in MiB.
500510
let reservoir_mode = ReservoirMode::from_config(
501511
config.vmm_reservoir_percentage,
502512
config.vmm_reservoir_size_mb,
503513
);
504514

505-
let vmm_reservoir_manager = VmmReservoirManager::spawn(
506-
&log,
507-
long_running_task_handles.hardware_manager.clone(),
508-
reservoir_mode,
509-
);
515+
let vmm_reservoir_manager =
516+
VmmReservoirManager::spawn(&log, memory_sizes, reservoir_mode);
510517

511518
let instances = InstanceManager::new(
512519
parent_log.clone(),

sled-agent/src/vmm_reservoir.rs

Lines changed: 24 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ use std::sync::atomic::{AtomicU64, Ordering};
1212
use std::thread;
1313
use tokio::sync::{broadcast, oneshot};
1414

15-
use sled_hardware::HardwareManager;
15+
use sled_hardware::MemoryReservations;
1616

1717
#[derive(thiserror::Error, Debug)]
1818
pub enum Error {
@@ -35,7 +35,7 @@ pub enum Error {
3535
#[derive(Debug, Clone, Copy)]
3636
pub enum ReservoirMode {
3737
Size(u32),
38-
Percentage(u8),
38+
Percentage(f32),
3939
}
4040

4141
impl ReservoirMode {
@@ -44,7 +44,7 @@ impl ReservoirMode {
4444
///
4545
/// Panic upon invalid configuration
4646
pub fn from_config(
47-
percentage: Option<u8>,
47+
percentage: Option<f32>,
4848
size_mb: Option<u32>,
4949
) -> Option<ReservoirMode> {
5050
match (percentage, size_mb) {
@@ -135,6 +135,7 @@ impl VmmReservoirManagerHandle {
135135

136136
/// Manage the VMM reservoir in a background thread
137137
pub struct VmmReservoirManager {
138+
memory_reservations: MemoryReservations,
138139
reservoir_size: Arc<AtomicU64>,
139140
rx: flume::Receiver<ReservoirManagerMsg>,
140141
size_updated_tx: broadcast::Sender<()>,
@@ -146,7 +147,7 @@ pub struct VmmReservoirManager {
146147
impl VmmReservoirManager {
147148
pub fn spawn(
148149
log: &Logger,
149-
hardware_manager: HardwareManager,
150+
memory_reservations: sled_hardware::MemoryReservations,
150151
reservoir_mode: Option<ReservoirMode>,
151152
) -> VmmReservoirManagerHandle {
152153
let log = log.new(o!("component" => "VmmReservoirManager"));
@@ -157,15 +158,15 @@ impl VmmReservoirManager {
157158
let (tx, rx) = flume::bounded(0);
158159
let reservoir_size = Arc::new(AtomicU64::new(0));
159160
let manager = VmmReservoirManager {
161+
memory_reservations,
160162
reservoir_size: reservoir_size.clone(),
161163
size_updated_tx: size_updated_tx.clone(),
162164
_size_updated_rx,
163165
rx,
164166
log,
165167
};
166-
let _manager_handle = Arc::new(thread::spawn(move || {
167-
manager.run(hardware_manager, reservoir_mode)
168-
}));
168+
let _manager_handle =
169+
Arc::new(thread::spawn(move || manager.run(reservoir_mode)));
169170
VmmReservoirManagerHandle {
170171
reservoir_size,
171172
tx,
@@ -174,31 +175,26 @@ impl VmmReservoirManager {
174175
}
175176
}
176177

177-
fn run(
178-
self,
179-
hardware_manager: HardwareManager,
180-
reservoir_mode: Option<ReservoirMode>,
181-
) {
178+
fn run(self, reservoir_mode: Option<ReservoirMode>) {
182179
match reservoir_mode {
183180
None => warn!(self.log, "Not using VMM reservoir"),
184181
Some(ReservoirMode::Size(0))
185-
| Some(ReservoirMode::Percentage(0)) => {
182+
| Some(ReservoirMode::Percentage(0.0)) => {
186183
warn!(
187184
self.log,
188185
"Not using VMM reservoir (size 0 bytes requested)"
189186
)
190187
}
191188
Some(mode) => {
192-
if let Err(e) = self.set_reservoir_size(&hardware_manager, mode)
193-
{
189+
if let Err(e) = self.set_reservoir_size(mode) {
194190
error!(self.log, "Failed to setup VMM reservoir: {e}");
195191
}
196192
}
197193
}
198194

199195
while let Ok(msg) = self.rx.recv() {
200196
let ReservoirManagerMsg::SetReservoirSize { mode, reply_tx } = msg;
201-
match self.set_reservoir_size(&hardware_manager, mode) {
197+
match self.set_reservoir_size(mode) {
202198
Ok(()) => {
203199
let _ = reply_tx.send(Ok(()));
204200
}
@@ -213,33 +209,28 @@ impl VmmReservoirManager {
213209
/// Sets the VMM reservoir to the requested percentage of usable physical
214210
/// RAM or to a size in MiB. Either mode will round down to the nearest
215211
/// aligned size required by the control plane.
216-
fn set_reservoir_size(
217-
&self,
218-
hardware: &sled_hardware::HardwareManager,
219-
mode: ReservoirMode,
220-
) -> Result<(), Error> {
221-
let hardware_physical_ram_bytes = hardware.usable_physical_ram_bytes();
212+
fn set_reservoir_size(&self, mode: ReservoirMode) -> Result<(), Error> {
213+
let vmm_eligible_memory = self.memory_reservations.vmm_eligible();
222214
let req_bytes = match mode {
223215
ReservoirMode::Size(mb) => {
224216
let bytes = ByteCount::from_mebibytes_u32(mb).to_bytes();
225-
if bytes > hardware_physical_ram_bytes {
217+
if bytes > vmm_eligible_memory {
226218
return Err(Error::ReservoirConfig(format!(
227-
"cannot specify a reservoir of {bytes} bytes when \
228-
physical memory is {hardware_physical_ram_bytes} bytes",
219+
"cannot specify a reservoir of {bytes} bytes when the \
220+
maximum reservoir size is {vmm_eligible_memory} bytes",
229221
)));
230222
}
231223
bytes
232224
}
233225
ReservoirMode::Percentage(percent) => {
234-
if !matches!(percent, 1..=99) {
226+
if !matches!(percent, 0.1..100.0) {
235227
return Err(Error::ReservoirConfig(format!(
236228
"VMM reservoir percentage of {} must be between 0 and \
237229
100",
238230
percent
239231
)));
240232
};
241-
(hardware_physical_ram_bytes as f64
242-
* (f64::from(percent) / 100.0))
233+
(vmm_eligible_memory as f64 * (f64::from(percent) / 100.0))
243234
.floor() as u64
244235
}
245236
};
@@ -258,15 +249,16 @@ impl VmmReservoirManager {
258249
}
259250

260251
// The max ByteCount value is i64::MAX, which is ~8 million TiB.
261-
// As this value is either a percentage of DRAM or a size in MiB
262-
// represented as a u32, constructing this should always work.
252+
// As this value is either a percentage of otherwise-unbudgeted DRAM or
253+
// a size in MiB represented as a u32, constructing this should always
254+
// work.
263255
let reservoir_size = ByteCount::try_from(req_bytes_aligned).unwrap();
264256
if let ReservoirMode::Percentage(percent) = mode {
265257
info!(
266258
self.log,
267-
"{}% of {} physical ram = {} bytes)",
259+
"{}% of {} VMM eligible ram = {} bytes)",
268260
percent,
269-
hardware_physical_ram_bytes,
261+
vmm_eligible_memory,
270262
req_bytes,
271263
);
272264
}

sled-hardware/src/illumos/mod.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,7 @@ struct HardwareView {
205205
disks: HashMap<DiskIdentity, UnparsedDisk>,
206206
baseboard: Option<Baseboard>,
207207
online_processor_count: u32,
208+
usable_physical_pages: u64,
208209
usable_physical_ram_bytes: u64,
209210
}
210211

@@ -220,6 +221,7 @@ impl HardwareView {
220221
disks: HashMap::new(),
221222
baseboard: None,
222223
online_processor_count: sysconf::online_processor_count()?,
224+
usable_physical_pages: sysconf::usable_physical_pages()?,
223225
usable_physical_ram_bytes: sysconf::usable_physical_ram_bytes()?,
224226
})
225227
}
@@ -230,6 +232,7 @@ impl HardwareView {
230232
disks: HashMap::new(),
231233
baseboard: None,
232234
online_processor_count: sysconf::online_processor_count()?,
235+
usable_physical_pages: sysconf::usable_physical_pages()?,
233236
usable_physical_ram_bytes: sysconf::usable_physical_ram_bytes()?,
234237
})
235238
}
@@ -798,6 +801,10 @@ impl HardwareManager {
798801
self.inner.lock().unwrap().online_processor_count
799802
}
800803

804+
pub fn usable_physical_pages(&self) -> u64 {
805+
self.inner.lock().unwrap().usable_physical_pages
806+
}
807+
801808
pub fn usable_physical_ram_bytes(&self) -> u64 {
802809
self.inner.lock().unwrap().usable_physical_ram_bytes
803810
}

sled-hardware/src/illumos/sysconf.rs

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,24 @@ pub fn online_processor_count() -> Result<u32, Error> {
2525
Ok(u32::try_from(res)?)
2626
}
2727

28-
/// Returns the amount of RAM on this sled, in bytes.
29-
pub fn usable_physical_ram_bytes() -> Result<u64, Error> {
30-
let phys_pages: u64 = illumos_utils::libc::sysconf(libc::_SC_PHYS_PAGES)
28+
/// Returns the number of physical RAM pages on this sled.
29+
pub fn usable_physical_pages() -> Result<u64, Error> {
30+
let pages = illumos_utils::libc::sysconf(libc::_SC_PHYS_PAGES)
3131
.map_err(|e| Error::Sysconf { arg: "physical pages", e })?
3232
.try_into()?;
33+
Ok(pages)
34+
}
35+
36+
/// Returns the amount of RAM on this sled, in bytes.
37+
pub fn usable_physical_ram_bytes() -> Result<u64, Error> {
3338
let page_size: u64 = illumos_utils::libc::sysconf(libc::_SC_PAGESIZE)
3439
.map_err(|e| Error::Sysconf { arg: "physical page size", e })?
3540
.try_into()?;
3641

37-
Ok(phys_pages * page_size)
42+
// Note that `_SC_PHYS_PAGES` counts, specifically, the number of
43+
// `_SC_PAGESIZE` pages of physical memory. This means the multiplication
44+
// below yields the total physical RAM bytes, even if in some sense there
45+
// are fewer "actual" physical pages in page tables (such as if there were
46+
// 2MiB pages mixed in on x86).
47+
Ok(usable_physical_pages()? * page_size)
3848
}

sled-hardware/src/lib.rs

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
use schemars::JsonSchema;
66
use serde::{Deserialize, Serialize};
7+
use slog::{Logger, info};
78

89
cfg_if::cfg_if! {
910
if #[cfg(target_os = "illumos")] {
@@ -75,3 +76,76 @@ pub enum SledMode {
7576
/// Force sled to run as a Scrimlet
7677
Scrimlet { asic: DendriteAsic },
7778
}
79+
80+
/// Accounting for high watermark memory usage for various system purposes
81+
#[derive(Clone)]
82+
pub struct MemoryReservations {
83+
log: Logger,
84+
hardware_manager: HardwareManager,
85+
/// The amount of memory expected to be used if "control plane" services all
86+
/// running on this sled. "control plane" here refers to services that have
87+
/// roughly fixed memory use given differing sled hardware configurations.
88+
/// DNS (internal, external), Nexus, Cockroach, or ClickHouse are all
89+
/// examples of "control plane" here.
90+
///
91+
/// This is a pessimistic overestimate; it is unlikely
92+
/// (and one might say undesirable) that all such services are colocated on
93+
/// a sled, and (as described in RFD 413) the budgeting for each service's
94+
/// RAM must include headroom for those services potentially forking and
95+
/// bursting required swap or resident pages.
96+
//
97+
// XXX: This is really something we should be told by Nexus, perhaps after
98+
// starting with this conservative estimate to get the sled started.
99+
control_plane_earmark_bytes: u64,
100+
// XXX: Crucible involves some amount of memory in support of the volumes it
101+
// manages. We should collect zpool size and estimate the memory that would
102+
// be used if all available storage was dedicated to Crucible volumes. For
103+
// now this is part of the control plane earmark.
104+
}
105+
106+
impl MemoryReservations {
107+
pub fn new(
108+
log: Logger,
109+
hardware_manager: HardwareManager,
110+
control_plane_earmark_mib: Option<u32>,
111+
) -> MemoryReservations {
112+
const MIB: u64 = 1024 * 1024;
113+
let control_plane_earmark_bytes =
114+
u64::from(control_plane_earmark_mib.unwrap_or(0)) * MIB;
115+
116+
Self { log, hardware_manager, control_plane_earmark_bytes }
117+
}
118+
119+
/// Compute the amount of physical memory that could be set aside for the
120+
/// VMM reservoir.
121+
///
122+
/// The actual VMM reservoir will be smaller than this amount, and is either
123+
/// a fixed amount of memory specified by `ReservoirMode::Size` or
124+
/// a percentage of this amount specified by `ReservoirMode::Percentage`.
125+
pub fn vmm_eligible(&self) -> u64 {
126+
let hardware_physical_ram_bytes =
127+
self.hardware_manager.usable_physical_ram_bytes();
128+
// Don't like hardcoding a struct size from the host OS here like
129+
// this, maybe we shuffle some bits around before merging.. On the
130+
// other hand, the last time page_t changed was illumos-gate commit
131+
// a5652762e5 from 2006.
132+
const PAGE_T_SIZE: u64 = 120;
133+
let max_page_t_bytes =
134+
self.hardware_manager.usable_physical_pages() * PAGE_T_SIZE;
135+
136+
let vmm_eligible = hardware_physical_ram_bytes
137+
- max_page_t_bytes
138+
- self.control_plane_earmark_bytes;
139+
140+
info!(
141+
self.log,
142+
"Calculated eligible VMM reservoir size";
143+
"vmm_eligible" => %vmm_eligible,
144+
"physical_ram_bytes" => %hardware_physical_ram_bytes,
145+
"max_page_t_bytes" => %max_page_t_bytes,
146+
"control_plane_earmark_bytes" => %self.control_plane_earmark_bytes,
147+
);
148+
149+
vmm_eligible
150+
}
151+
}

sled-hardware/src/non_illumos/mod.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@ impl HardwareManager {
4545
unimplemented!("Accessing hardware unsupported on non-illumos");
4646
}
4747

48+
pub fn usable_physical_pages(&self) -> u64 {
49+
unimplemented!("Accessing hardware unsupported on non-illumos");
50+
}
51+
4852
pub fn usable_physical_ram_bytes(&self) -> u64 {
4953
unimplemented!("Accessing hardware unsupported on non-illumos");
5054
}

0 commit comments

Comments
 (0)