Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 0 additions & 12 deletions .github/buildomat/jobs/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -429,18 +429,6 @@ do
ACTUAL_ZPOOL_COUNT=$(pfexec zlogin oxz_switch /opt/oxide/omdb/bin/omdb db zpool list -i | wc -l)
done

# The bootstrap command creates a disk, so before that: adjust the control plane
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

# storage buffer to 0 as the virtual hardware only creates 20G pools

pfexec zlogin oxz_switch /opt/oxide/omdb/bin/omdb db zpool list

for ZPOOL in $(pfexec zlogin oxz_switch /opt/oxide/omdb/bin/omdb db zpool list -i);
do
pfexec zlogin oxz_switch /opt/oxide/omdb/bin/omdb -w db zpool set-storage-buffer "${ZPOOL}" 0
done

pfexec zlogin oxz_switch /opt/oxide/omdb/bin/omdb db zpool list

export RUST_BACKTRACE=1
export E2E_TLS_CERT IPPOOL_START IPPOOL_END
eval "$(./target/debug/bootstrap)"
Expand Down
14 changes: 13 additions & 1 deletion nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,7 @@ pub struct MgdConfig {
struct UnvalidatedTunables {
max_vpc_ipv4_subnet_prefix: u8,
load_timeout: Option<std::time::Duration>,
control_plane_storage_buffer_gb: u32,
}

/// Configuration for HTTP clients to external services.
Expand Down Expand Up @@ -303,6 +304,11 @@ pub struct Tunables {
///
/// If "None", nexus loops forever during initialization.
pub load_timeout: Option<std::time::Duration>,

/// The amount of disk space to reserve for non-Crucible / control plane
/// storage in gibibytes. This amount represents a buffer that the region
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The amount of disk space to reserve for non-Crucible / control plane
/// storage in gibibytes. This amount represents a buffer that the region
/// The amount of space to reserve per-disk for non-Crucible storage (i.e., control plane
/// storage) in gibibytes. This amount represents a buffer that the region

Trying to be really clear for people not already familiar with this.

/// allocation query will not use for each U2.
pub control_plane_storage_buffer_gb: u32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right: this is part of PackageConfig, which is documented as "configuration parameters known at compile-time"; as opposed to DeploymentConfig, which are runtime, deployment-specific tunables. But isn't the point of doing it this way to make it runtime?

Edit: I think this is basically fine -- see my long comment below for more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub control_plane_storage_buffer_gb: u32,
pub control_plane_storage_buffer_gib: u32,

}

// Convert from the unvalidated tunables, verifying each parameter as needed.
Expand All @@ -314,6 +320,8 @@ impl TryFrom<UnvalidatedTunables> for Tunables {
Ok(Tunables {
max_vpc_ipv4_subnet_prefix: unvalidated.max_vpc_ipv4_subnet_prefix,
load_timeout: unvalidated.load_timeout,
control_plane_storage_buffer_gb: unvalidated
.control_plane_storage_buffer_gb,
})
}
}
Expand Down Expand Up @@ -365,6 +373,7 @@ impl Default for Tunables {
Tunables {
max_vpc_ipv4_subnet_prefix: MAX_VPC_IPV4_SUBNET_PREFIX,
load_timeout: None,
control_plane_storage_buffer_gb: 0,
}
}
}
Expand Down Expand Up @@ -1014,6 +1023,7 @@ mod test {
trusted_root = "/path/to/root.json"
[tunables]
max_vpc_ipv4_subnet_prefix = 27
control_plane_storage_buffer_gb = 0
[deployment]
id = "28b90dc4-c22a-65ba-f49a-f051fe01208f"
rack_id = "38b90dc4-c22a-65ba-f49a-f051fe01208f"
Expand Down Expand Up @@ -1156,7 +1166,8 @@ mod test {
schema: None,
tunables: Tunables {
max_vpc_ipv4_subnet_prefix: 27,
load_timeout: None
load_timeout: None,
control_plane_storage_buffer_gb: 0,
},
dendrite: HashMap::from([(
SwitchLocation::Switch0,
Expand Down Expand Up @@ -1477,6 +1488,7 @@ mod test {
default_base_url = "http://example.invalid/"
[tunables]
max_vpc_ipv4_subnet_prefix = 100
control_plane_storage_buffer_gb = 0
[deployment]
id = "28b90dc4-c22a-65ba-f49a-f051fe01208f"
rack_id = "38b90dc4-c22a-65ba-f49a-f051fe01208f"
Expand Down
3 changes: 3 additions & 0 deletions nexus/examples/config-second.toml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,9 @@ default_request_body_max_bytes = 1048576
# IPv4 subnetwork. This size allows for ~60 hosts.
max_vpc_ipv4_subnet_prefix = 26

# For development environments, choose a zero sized storage buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# For development environments, choose a zero sized storage buffer
# In development environments, reserve no space for control plane storage.

control_plane_storage_buffer_gb = 0

# Configuration for interacting with the dataplane daemon
[dendrite.switch0]
address = "[::1]:12224"
Expand Down
3 changes: 3 additions & 0 deletions nexus/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ url = "postgresql://root@[::1]:32221/omicron?sslmode=disable"
# IPv4 subnetwork. This size allows for ~60 hosts.
max_vpc_ipv4_subnet_prefix = 26

# For development environments, choose a zero sized storage buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# For development environments, choose a zero sized storage buffer
# In development environments, reserve no space for control plane storage.

control_plane_storage_buffer_gb = 0

# Configuration for interacting with the dataplane daemon
[dendrite.switch0]
address = "[::1]:12224"
Expand Down
6 changes: 6 additions & 0 deletions nexus/src/app/background/init.rs
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ use nexus_db_model::DnsGroup;
use nexus_db_queries::context::OpContext;
use nexus_db_queries::db::DataStore;
use nexus_types::deployment::PendingMgsUpdates;
use omicron_common::api::external::ByteCount;
use omicron_uuid_kinds::OmicronZoneUuid;
use oximeter::types::ProducerRegistry;
use std::collections::BTreeMap;
Expand Down Expand Up @@ -525,6 +526,7 @@ impl BackgroundTasksInitializer {
inventory_watcher.clone(),
config.physical_disk_adoption.disable,
rack_id,
args.control_plane_storage_buffer,
),
),
opctx: opctx.child(BTreeMap::new()),
Expand Down Expand Up @@ -959,6 +961,10 @@ pub struct BackgroundTasksData {
pub webhook_delivery_client: reqwest::Client,
/// Channel for configuring pending MGS updates
pub mgs_updates_tx: watch::Sender<PendingMgsUpdates>,
/// The amount of disk space to reserve for non-Crucible / control plane
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same edit as above.

(Mainly, I think it's easy to misparse this as "non-(Crucible/control-plane) storage", which is very wrong.)

/// storage. This amount represents a buffer that the region allocation query
/// will not use for each U2.
pub control_plane_storage_buffer: ByteCount,
}

/// Starts the three DNS-propagation-related background tasks for either
Expand Down
7 changes: 5 additions & 2 deletions nexus/src/app/background/tasks/physical_disk_adoption.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
//!
//! In the future, this may become more explicitly operator-controlled.

use crate::app::CONTROL_PLANE_STORAGE_BUFFER;
use crate::app::background::BackgroundTask;
use futures::FutureExt;
use futures::future::BoxFuture;
Expand All @@ -20,6 +19,7 @@ use nexus_db_model::Zpool;
use nexus_db_queries::context::OpContext;
use nexus_db_queries::db::DataStore;
use nexus_types::identity::Asset;
use omicron_common::api::external::ByteCount;
use omicron_common::api::external::DataPageParams;
use omicron_uuid_kinds::CollectionUuid;
use omicron_uuid_kinds::GenericUuid;
Expand All @@ -34,6 +34,7 @@ pub struct PhysicalDiskAdoption {
disable: bool,
rack_id: Uuid,
rx_inventory_collection: watch::Receiver<Option<CollectionUuid>>,
control_plane_storage_buffer: ByteCount,
}

impl PhysicalDiskAdoption {
Expand All @@ -42,12 +43,14 @@ impl PhysicalDiskAdoption {
rx_inventory_collection: watch::Receiver<Option<CollectionUuid>>,
disable: bool,
rack_id: Uuid,
control_plane_storage_buffer: ByteCount,
) -> Self {
PhysicalDiskAdoption {
datastore,
disable,
rack_id,
rx_inventory_collection,
control_plane_storage_buffer,
}
}
}
Expand Down Expand Up @@ -140,7 +143,7 @@ impl BackgroundTask for PhysicalDiskAdoption {
Uuid::new_v4(),
inv_disk.sled_id.into_untyped_uuid(),
disk.id(),
CONTROL_PLANE_STORAGE_BUFFER.into(),
self.control_plane_storage_buffer.into(),
);

let result = self.datastore.physical_disk_and_zpool_insert(
Expand Down
19 changes: 11 additions & 8 deletions nexus/src/app/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -135,14 +135,6 @@ pub const MAX_DISK_SIZE_BYTES: u64 = 1023 * (1 << 30); // 1023 GiB
/// This value is aribtrary
pub const MAX_SSH_KEYS_PER_INSTANCE: u32 = 100;

/// The amount of disk space to reserve for non-Crucible / control plane
/// storage. This amount represents a buffer that the region allocation query
/// will not use for each U2.
///
/// See oxidecomputer/omicron#7875 for the 250G determination.
pub const CONTROL_PLANE_STORAGE_BUFFER: ByteCount =
ByteCount::from_gibibytes_u32(250);

/// Manages an Oxide fleet -- the heart of the control plane
pub struct Nexus {
/// uuid for this nexus instance.
Expand Down Expand Up @@ -253,6 +245,11 @@ pub struct Nexus {

/// reports status of pending MGS-managed updates
mgs_update_status_rx: watch::Receiver<MgsUpdateDriverStatus>,

/// The amount of disk space to reserve for non-Crucible / control plane
/// storage. This amount represents a buffer that the region allocation query
/// will not use for each U2.
control_plane_storage_buffer: ByteCount,
}

impl Nexus {
Expand Down Expand Up @@ -429,6 +426,10 @@ impl Nexus {
let mgs_update_status_rx = mgs_update_driver.status_rx();
let _mgs_driver_task = tokio::spawn(mgs_update_driver.run());

let control_plane_storage_buffer = ByteCount::from_gibibytes_u32(
config.pkg.tunables.control_plane_storage_buffer_gb,
);

let nexus = Nexus {
id: config.deployment.id,
rack_id,
Expand Down Expand Up @@ -480,6 +481,7 @@ impl Nexus {
)),
tuf_artifact_replication_tx,
mgs_update_status_rx,
control_plane_storage_buffer,
};

// TODO-cleanup all the extra Arcs here seems wrong
Expand Down Expand Up @@ -539,6 +541,7 @@ impl Nexus {
},
tuf_artifact_replication_rx,
mgs_updates_tx,
control_plane_storage_buffer,
},
);

Expand Down
3 changes: 1 addition & 2 deletions nexus/src/app/rack.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

//! Rack management

use crate::app::CONTROL_PLANE_STORAGE_BUFFER;
use crate::external_api::params;
use crate::external_api::params::CertificateCreate;
use crate::external_api::shared::ServiceUsingCertificate;
Expand Down Expand Up @@ -137,7 +136,7 @@ impl super::Nexus {
pool.id,
pool.sled_id,
pool.physical_disk_id,
CONTROL_PLANE_STORAGE_BUFFER.into(),
self.control_plane_storage_buffer.into(),
)
})
.collect();
Expand Down
3 changes: 3 additions & 0 deletions nexus/tests/config.test.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ address = "[::1]:0"
# Allow small subnets, so we can test IP address exhaustion easily / quickly
max_vpc_ipv4_subnet_prefix = 29

# For development environments, choose a zero sized storage buffer
control_plane_storage_buffer_gb = 0

[deployment]
# Identifier for this instance of Nexus.
# NOTE: The test suite always overrides this.
Expand Down
7 changes: 7 additions & 0 deletions smf/nexus/multi-sled/config-partial.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ if_exists = "append"
# [schema]
# schema_dir = "/var/nexus/schema/crdb"

[tunables]
max_vpc_ipv4_subnet_prefix = 26

# Reserve space for non-Crucible storage
# See oxidecomputer/omicron#7875 for the 250G determination.
control_plane_storage_buffer_gb = 250
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: I think what you've got here is probably right but it took me a while to work through some other options and I want to record that here.


This is a different approach than I intended, but it might be just as good for what we want!

What I was trying to propose was something like:

  1. RSS accept this value as configuration here, where it takes other deployment-time parameters:

    /// Configuration for the "rack setup service".
    ///
    /// The Rack Setup Service should be responsible for one-time setup actions,
    /// such as CockroachDB placement and initialization. Without operator
    /// intervention, however, these actions need a way to be automated in our
    /// deployment.
    #[derive(Clone, Deserialize, Serialize, PartialEq, JsonSchema)]
    #[serde(try_from = "UnvalidatedRackInitializeRequest")]
    pub struct RackInitializeRequest {
    /// The set of peer_ids required to initialize trust quorum
    ///
    /// The value is `None` if we are not using trust quorum
    pub trust_quorum_peers: Option<Vec<Baseboard>>,
    /// Describes how bootstrap addresses should be collected during RSS.
    pub bootstrap_discovery: BootstrapAddressDiscovery,
    /// The external NTP server addresses.
    pub ntp_servers: Vec<String>,
    /// The external DNS server addresses.
    pub dns_servers: Vec<IpAddr>,
    /// Ranges of the service IP pool which may be used for internal services.
    // TODO(https://github.com/oxidecomputer/omicron/issues/1530): Eventually,
    // we want to configure multiple pools.
    pub internal_services_ip_pool_ranges: Vec<IpRange>,
    /// Service IP addresses on which we run external DNS servers.
    ///
    /// Each address must be present in `internal_services_ip_pool_ranges`.
    pub external_dns_ips: Vec<IpAddr>,
    /// DNS name for the DNS zone delegated to the rack for external DNS
    pub external_dns_zone_name: String,
    /// initial TLS certificates for the external API
    pub external_certificates: Vec<Certificate>,
    /// Configuration of the Recovery Silo (the initial Silo)
    pub recovery_silo: RecoverySiloConfig,
    /// Initial rack network configuration
    pub rack_network_config: RackNetworkConfig,
    /// IPs or subnets allowed to make requests to user-facing services
    #[serde(default = "default_allowed_source_ips")]
    pub allowed_source_ips: AllowedSourceIps,
    }

  2. Wicket provides it with a hardcoded value of 250 here, where it constructs that struct:

    let request = RackInitializeRequest {
    trust_quorum_peers,
    bootstrap_discovery: BootstrapAddressDiscovery::OnlyThese(
    bootstrap_ips,
    ),
    ntp_servers: self.ntp_servers.clone(),
    dns_servers: self.dns_servers.clone(),
    internal_services_ip_pool_ranges,
    external_dns_ips: self.external_dns_ips.clone(),
    external_dns_zone_name: self.external_dns_zone_name.clone(),
    external_certificates: self.external_certificates.clone(),
    recovery_silo: RecoverySiloConfig {
    silo_name: Name::try_from(RECOVERY_SILO_NAME).unwrap(),
    user_name: UserId::try_from(RECOVERY_SILO_USERNAME).unwrap(),
    user_password_hash,
    },
    rack_network_config,
    allowed_source_ips: self
    .allowed_source_ips
    .clone()
    .unwrap_or(AllowedSourceIps::Any),
    };

  3. Sled agent would write it into the "deployment" section of the Nexus config file, here:

    let deployment_config = DeploymentConfig {
    id: *id,
    rack_id: info.rack_id,
    techport_external_server_port: NEXUS_TECHPORT_EXTERNAL_PORT,
    dropshot_external: ConfigDropshotWithTls {
    tls: *external_tls,
    dropshot: dropshot::ConfigDropshot {
    bind_address: SocketAddr::new(*opte_ip, nexus_port),
    default_request_body_max_bytes: 1048576,
    default_handler_task_mode:
    HandlerTaskMode::Detached,
    log_headers: vec![],
    },
    },
    dropshot_internal: dropshot::ConfigDropshot {
    bind_address: (*internal_address).into(),
    default_request_body_max_bytes: 1048576,
    default_handler_task_mode: HandlerTaskMode::Detached,
    log_headers: vec![],
    },
    internal_dns: nexus_config::InternalDns::FromSubnet {
    subnet: Ipv6Subnet::<RACK_PREFIX>::new(
    info.underlay_address,
    ),
    },
    database: nexus_config::Database::FromDns,
    external_dns_servers: external_dns_servers.clone(),
    // TCP connections bound by HTTP clients of external services
    // should always be bound on our OPTE interface.
    external_http_clients:
    nexus_config::ExternalHttpClientConfig {
    interface: Some(opte_iface_name.to_string()),
    },
    };

  4. For non-Wicket deployments (which includes helios-deploy, the "how-to-run" flow, and a4x2), these use the RSS config files in smf/sled-agent, and those would all specify 0 (or some other suitable number).

I thought this was similar to how some other config works (e.g., whether Nexus is configured for TLS), but it's slightly different:

  • re: Nexus TLS: Sled Agent does write it into the config here, but it's based on the Nexus zone config, which itself is presumably determined by RSS based on the presence of certificates
  • the name of the recovery silo and its user follow a similar flow except they get provided to Nexus with the RSS-to-Nexus handoff instead of being written into the Nexus config file
  • the rack id is written into the Nexus config by Sled Agent but it's based on how the Sled was initialized, not the rack initialization request.

If we want this to come in as an RSS tunable, I think there are basically three flows:

  1. Put it into the sled config so every sled agent has it. Use that when Sled Agent writes Nexus's config file. Easy, works, but kind of gross because this isn't really sled config.
  2. Put it into the Nexus zone config and have RSS fill it in there, then have sled agent use that similar to how it fills in external_dns_servers, external_tls, etc. This is fine but a fair bit of work and also makes Sled Agent uses implicit interfaces with components it provisions #3407 slightly worse.
  3. Put it into the RSS-to-Nexus handoff. This is fine but means that it's actually dynamic in Nexus -- it'd be stored in the database and presumably re-read periodically or something. This too is a lot more work (though less than (2) and avoids making Sled Agent uses implicit interfaces with components it provisions #3407 any worse).

The main advantage to one of these approaches is that it's truly a deployment-time config. What's in this PR is still compile-time. It's just being determined late enough that we know which deployment we're targeting. This works okay right now because we already need different sets of packages for these deployments. I think we want to get away from needing different packages. But that will already be a bunch of work and this PR doesn't make it much harder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just finished updating and testing this branch with a4x2, and you were spot on with your suspicion below: the 250 GiB storage buffer is applied in the a4x2 case.

My reasoning here was that the value(s) were known at "compile-time": it's either 0 or 250 GiB. But this doesn't work for a4x2, which uses the multi-sled nexus config-partial. This distinction I was missing was that multi-sled does not equal prod, and single sled does not equal dev.

I'm going to change this PR to use the method recommended in this comment, which will require a follow up testbed PR to change the config-rss.toml over there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davepacheco I started implementing 2, but ended up implementing 3 - I think it works better to store the value in the database.


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bigger question I have is: does a4x2 not use this file? or: have you tested that this does the right thing on a4x2?

[background_tasks]
dns_internal.period_secs_config = 60
dns_internal.period_secs_servers = 60
Expand Down
6 changes: 6 additions & 0 deletions smf/nexus/single-sled/config-partial.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ if_exists = "append"
# [schema]
# schema_dir = "/var/nexus/schema/crdb"

[tunables]
max_vpc_ipv4_subnet_prefix = 26

# For development environments, choose a zero sized storage buffer
control_plane_storage_buffer_gb = 0

[background_tasks]
dns_internal.period_secs_config = 60
dns_internal.period_secs_servers = 60
Expand Down
Loading