-
Notifications
You must be signed in to change notification settings - Fork 61
Labels
customerFor any bug reports or feature requests tied to customer requestsFor any bug reports or feature requests tied to customer requests
Milestone
Description
(This high-level issue might well get turned into smaller issues and then closed.)
Recent customer issues have highlighted several problems around storage accounting and space management.
Problems
- The process of allocating Crucible regions assumes that Crucible can use the whole disk. This does not account for other uses of U.2 devices: namely, the root filesystems and persistent datasets of control plane zones and Propolis zones. (Anything else?)
- The process of allocating Crucible regions does not account for Crucible overhead (a fudge factor of about 1.2-1.25x).
- We don't have quotas or reservations in most places, so if something does go haywire, it can take a lot out with it.
- There appears to be no process that clears old data out of debug datasets. We have deployed systems where some disks have upwards of 1 TiB of space used by these files.
- While investigating this, we discovered crypt/debug dataset not mounted #7874.
All of these can cause systems to run out of disk space, triggering various ugly failures.
Proposed steps
For R14
- make a quick, conservative guess for how much storage space we can safely give to Crucible while leaving enough for other consumers - outcome summarized here.
- update Crucible region allocation to use this as the total size available to Crucible rather than the pool size
Prevent region allocation from filling pools #7912 - update accounting data for Crucible regions to reflect the 1.2-1.25x fudge factor (specific proposal was: create a new column with this information so that the table reflects both actual space requested and what we've allocated for it) Account for Crucible Agent reservation overhead #7885
- update the Crucible region allocation to use the value with-fudge-factor instead of the requested size of each region Account for Crucible Agent reservation overhead #7885
- fix debug dataset mount issue [sled-agent] Ensure that datasets get mounted #7887, [sled-agent] Make new mountpoints immutable #7888
- add omdb tool to identify pools that are overprovisioned, when considering the Crucible fudge factor and the max space allowed for Crucible Prevent region allocation from filling pools #7912
- create support runbook that uses omdb to identify regions that should be replaced and trigger replacement for them (and plan to do this during R14 upgrade) --- PENDING (may punt this if the number of overprovisioned pools is small; the tools will be necessary when we set quotas on crucible datasets)
- create support runbook for clear out large debug directories and reserve clickhouse dataset + 2 spares (and plan to do this during R14 upgrade) https://github.com/oxidecomputer/customer-support/issues/333
Longer term
- for each consumer of disk space, on all disks (this would include: crash dumps, core dumps, GZ data, zone root filesystems, zone persistent data, etc.)
- make an estimate of how much disk space it needs
- figure out what steps might be necessary to keep that under its limit (e.g., log rotation, archiving, deletion)
- update the estimate we made above (for how much Crucible is allowed to use) to reflect this
- for crucible regions, implement measures to prevent unbound snapshot usage (e.g., pantry scrubber) so we can shrink region dataset's 3x quota
- update Reconfigurator to apply reservations and/or quotas to implement those limits
- update Reconfigurator to look at the storage allocated and required when placing components (so that we don't overprovision storage on disks)
- figure out why debug datasets are sometimes not mounted and how to fix it (see: crypt/debug dataset not mounted #7874)
- implement some kind of monitoring for disk space usage (raise an active problem if datasets get full)
- review/improve policy for deleting files from the debug dataset (current logic)
Sub-issues
Metadata
Metadata
Assignees
Labels
customerFor any bug reports or feature requests tied to customer requestsFor any bug reports or feature requests tied to customer requests