Skip to content

keep the system from running out of disk space #7875

@davepacheco

Description

@davepacheco

(This high-level issue might well get turned into smaller issues and then closed.)

Recent customer issues have highlighted several problems around storage accounting and space management.

Problems

  • The process of allocating Crucible regions assumes that Crucible can use the whole disk. This does not account for other uses of U.2 devices: namely, the root filesystems and persistent datasets of control plane zones and Propolis zones. (Anything else?)
  • The process of allocating Crucible regions does not account for Crucible overhead (a fudge factor of about 1.2-1.25x).
  • We don't have quotas or reservations in most places, so if something does go haywire, it can take a lot out with it.
  • There appears to be no process that clears old data out of debug datasets. We have deployed systems where some disks have upwards of 1 TiB of space used by these files.
  • While investigating this, we discovered crypt/debug dataset not mounted #7874.

All of these can cause systems to run out of disk space, triggering various ugly failures.

Proposed steps

For R14

Longer term

  • for each consumer of disk space, on all disks (this would include: crash dumps, core dumps, GZ data, zone root filesystems, zone persistent data, etc.)
    • make an estimate of how much disk space it needs
    • figure out what steps might be necessary to keep that under its limit (e.g., log rotation, archiving, deletion)
    • update the estimate we made above (for how much Crucible is allowed to use) to reflect this
  • for crucible regions, implement measures to prevent unbound snapshot usage (e.g., pantry scrubber) so we can shrink region dataset's 3x quota
  • update Reconfigurator to apply reservations and/or quotas to implement those limits
  • update Reconfigurator to look at the storage allocated and required when placing components (so that we don't overprovision storage on disks)
  • figure out why debug datasets are sometimes not mounted and how to fix it (see: crypt/debug dataset not mounted #7874)
  • implement some kind of monitoring for disk space usage (raise an active problem if datasets get full)
  • review/improve policy for deleting files from the debug dataset (current logic)

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    customerFor any bug reports or feature requests tied to customer requests

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions