-
Notifications
You must be signed in to change notification settings - Fork 738
feat(pageserver): enable reldirv2 by default in regress tests #12758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a5debfe
to
dcf38bc
Compare
Signed-off-by: Alex Chi Z <[email protected]>
dcf38bc
to
f6b1891
Compare
9152 tests run: 8490 passed, 0 failed, 662 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
5713ff3 at 2025-07-30T16:22:16.761Z :recycle: |
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
8d8d989
to
d37cd4b
Compare
Signed-off-by: Alex Chi Z <[email protected]>
302884e
to
00091a4
Compare
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't paged this back in enough to give it a thorough review, but it seems okay at a high level.
@@ -651,6 +651,9 @@ pub struct TenantConfigToml { | |||
// FIXME: Remove skip_serializing_if when the feature is stable. | |||
#[serde(skip_serializing_if = "std::ops::Not::not")] | |||
pub basebackup_cache_enabled: bool, | |||
|
|||
#[serde(skip_serializing_if = "std::ops::Not::not")] | |||
pub rel_size_v1_access_disabled: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can this get a doc comment please?
@@ -1526,12 +1539,15 @@ pub struct OffloadedTimelineInfo { | |||
pub archived_at: chrono::DateTime<chrono::Utc>, | |||
} | |||
|
|||
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] | |||
/// The state of the rel size migration. This is persisted in the index part. | |||
/// compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: stray line?
@@ -1724,6 +1731,8 @@ pub struct RelDirMode { | |||
current_status: RelSizeMigration, | |||
// Whether we should initialize the v2 keyspace or not. | |||
initialize: bool, | |||
// Whether we should disable v1 access starting this LSNor not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not an LSN
Problem
Part of LKB-197. This patch enables reldirv2 by default.
Using index_part to persist reldirv2 status is not reliable. In the current ingestion path when we initialize the relv2 keyspace, we will immediately set index_part to persist that we have enabled v2. However, if the pageserver is shut down or the tenant gets dropped for whatever reason, the disk persistent LSN might not catch up with the LSN where we enabled the relv2. Therefore, we will end up with a state that we have "v2 enabled" in index_part, but the v2 migration keys are not in the layer files.
This patch mainly addresses this issue by modifying the data path to look at the
migrated_at
LSN to decide whether to scan the v2 keyspace or not. If the request is below themigrated_at
LSN, only v1 is used.How this mechanism interacts with branches are not tested yet and will be in the upcoming patches. For example, if we create a branch below the migrated_at LSN of the current branch, we should reset it to legacy mode. Currently this is solely guarded by the
maybe_enable_rel_size_v2
(see "Revert the status to legacy...") and we should also do this during timeline creation.This patch also adds partial support for v2-only mode. This is governed by
rel_size_v1_access_disabled
flag. The code is here solely to make the tests pass, not 100% correct (for example, when the above situation happens during the switch from v1+v2 validation mode to v2-only mode, it might have some races). We do not plan to switch to v2-only mode in the near term. This needs more tests before we enable.There was a previous attempt #12708 to deal with this index_part <-> reality inconsistency by persisting the migration status in the keyspace. The pros of that approach is that we get very precise LSN where the migration is done and that LSN is consistent with when we do the migration. However, the downside is that it needs to query that key on every request, and it greatly slows down the tests (all pg_regress tests are timing out on ingestion). So in this patch, we take the approach to allow the inconsistency between index_part <-> real data and detect such inconsistency on the write path + in the future branch creation time.
Basically, the way to look into the patch is like:
<= migrated_lsn
instead of<
)timeline_offloading
test case but we will have a separate one in the future.Summary of changes
rel_size_v1_access_disabled
to fully disable v1 and transition the status toMigrated
.collect_keyspace
still needs that.migrated_at
LSN.test_image_consistent_lsn
: this test was added as part of the hadron new features. It has some assumption on the layer size and the relv2 migration process breaks that assumption. So I forced v1 in that test case, and I'll do a full fix later.