Compute release 2025-07-18 07:06 UTC #12653

vipvap · 2025-07-18T07:06:06Z

No description provided.

…nds (#12548) ## Problem We have a `safekeeper_migrate` handler, but no subcommand in `storcon_cli`. Same for `/:timeline_id/locate` for identifying current set of safekeepers. - Closes: #12395 ## Summary of changes - Add `timeline-safekeeper-migrate` and `timeline-locate` subcommands to `storcon_cli`

…12547) This is a no-op for the neon deployment * Introduce the concept image consistent lsn: of the largest LSN below which all pages have been redone successfully * Use the image consistent LSN for forced image layer creations * Optionally expose the image consistent LSN via the timeline describe HTTP endpoint * Add a sharded timeline describe endpoint to storcon --------- Co-authored-by: Chen Luo <[email protected]>

This log is too noisy. Instead of warning on every retry, let's log only on the final failure.

…12550) Split the functions into two: one internal function to calculate the estimate, and another (two functions) to expose it as SQL functions. This is in preparation of adding new communicator implementation. With that, the SQL functions will dispatch the call to the old or new implementation depending on which is being used.

## Problem Canceelation requires redis, redis required control-plane. ## Summary of changes Make redis for cancellation not require control plane. Add instructions for setting up redis locally.

This is a nifty trick from the hadron repo that seems to help with SSH key dance. Signed-off-by: Tristan Partin <[email protected]>

Update the WSS estimate before acquring the lock, so that we don't need to hold the lock for so long. That seems safe to me, see added comment. I was planning to do this with the new rust-based communicator implementation anyway, but it might help a little with the current C implementation too. And more importantly, having this as a separate PR gives us a chance to review this aspect independently.

… progress (#12523) Fixes [LKB-61](https://databricks.atlassian.net/browse/LKB-61): `test_timeline_archival_chaos` being flaky with storcon error `Requested tenant is missing`. When a tenant migration is ongoing, and the attach request has been sent to the new location, but the attach hasn't finished yet, it is possible for the pageserver to return a 412 precondition failed HTTP error on timeline deletion, because it is being sent to the new location already. That one we would previously log via sth like: ``` ERROR request{method=DELETE path=/v1/tenant/1f544a11c90d1afd7af9b26e48985a4e/timeline/32818fb3ebf07cb7f06805429d7dee38 request_id=c493c04b-7f33-46d2-8a65-aac8a5516055}: Error processing HTTP request: InternalServerError(Error deleting timeline 32 818fb3ebf07cb7f06805429d7dee38 on 1f544a11c90d1afd7af9b26e48985a4e on node 2 (localhost): pageserver API: Precondition failed: Requested tenant is missing ``` This patch changes that and makes us return a more reasonable resource unavailable error. Not sure how scalable this is with tenants with a large number of shards, but that's a different discussion (we'd probably need a limited amount of per-storcon retries). example [link](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12398/15981821532/index.html#/testresult/e7785dfb1238d92f).

## Problem close LKB-162 close neondatabase/cloud#30665, related to neondatabase/cloud#29434 We see a lot of errors like: ``` 2025-05-22T23:06:14.928959Z ERROR compaction_loop{tenant_id=? shard_id=0304}:run:gc_compact_timeline{timeline_id=?}: error applying 4 WAL records 35/DC0DF0B8..3B/E43188C0 (8119 bytes) to key 000000067F0000400500006027000000B9D0, from base image with LSN 0/0 to reconstruct page image at LSN 61/150B9B20 n_attempts=0: apply_wal_records Caused by: 0: read walredo stdout 1: early eof ``` which is an acceptable form of error and we should downgrade it to warning. ## Summary of changes walredo error during gc-compaction is expected when the data below the gc horizon does not contain a full key history. This is possible in some rare cases of gc that is only able to remove data in the middle of the history but not all earlier history when a full keyspace gets deleted. Signed-off-by: Alex Chi Z <[email protected]>

Even though we're now part of Databricks, let's at least make this part consistent. ## Summary of changes - PG14: neondatabase/postgres#669 - PG15: neondatabase/postgres#670 - PG16: neondatabase/postgres#671 - PG17: neondatabase/postgres#672 --------- Co-authored-by: Arpad Müller <[email protected]>

# TLDR Problem-I is a bug fix. The rest are no-ops. ## Problem I Page server checks image layer creation based on the elapsed time but this check depends on the current logical size, which is only computed on shard 0. Thus, for non-0 shards, the check will be ineffective and image creation will never be done for idle tenants. ## Summary of changes I This PR fixes the problem by simply removing the dependency on current logical size. ## Summary of changes II This PR adds a timeout when calling page server to split shard to make sure SC does not wait for the API call forever. Currently the PR doesn't adds any retry logic because it's not clear whether page server shard split can be safely retried if the existing operation is still ongoing or left the storage in a bad state. Thus it's better to abort the whole operation and restart. ## Problem III `test_remote_failures` requires PS to be compiled in the testing mode. For PS in dev/staging, they are compiled without this mode. ## Summary of changes III Remove the restriction and also increase the number of total failures allowed. ## Summary of changes IV remove test on PS getpage http route. --------- Co-authored-by: Chen Luo <[email protected]> Co-authored-by: Yecheng Yang <[email protected]> Co-authored-by: Vlad Lazar <[email protected]>

#12553) ## Problem To store cancellation data we send two commands to redis because the redis server version doesn't support HSET with EX. Also, HSET is not really needed. ## Summary of changes * Replace the HSET + EXPIRE command pair with one SET .. EX command. * Replace HGET with GET. * Leave a workaround for old keys set with HSET. * Replace some anyhow errors with specific errors to surface the WRONGTYPE error from redis.

Serialize query row responses directly into JSON. Some of this code should be using the `json::value_as_object/list` macros, but I've avoided it for now to minimize the size of the diff.

…2546) ## Problem We don't validate the validity of the `new_sk_set` before starting the migration. It is validated later, so the migration to an invalid safekeeper set will fail anyway. But at this point we might already commited an invalid `new_sk_set` to the database and there is no `abort` command yet (I ran into this issue in neon_local and ruined the timeline :) - Part of #11669 ## Summary of changes - Add safekeeper count and safekeeper duplication checks before starting the migration - Test that we validate the `new_sk_set` before starting the migration - Add `force` option to the `TimelineSafekeeperMigrateRequest` to disable not-mandatory checks

## Problem When a connection terminates its maintain_cancel_key task keeps running until the CANCEL_KEY_REFRESH sleep finishes and then it triggers another cancel key TTL refresh before exiting. ## Summary of changes * Check for cancellation while sleeping and interrupt sleep. * If cancelled, break the loop, don't send a refresh cmd.

## Problem When refreshing cancellation data we resend the entire value again just to reset the TTL, which causes unnecessary load in proxy, on network and possibly on redis side. ## Summary of changes * Switch from using SET with full value to using EXPIRE to reset TTL. * Add a tiny delay between retries to prevent busy loop. * Shorten CancelKeyOp variants: drop redundant suffix. * Retry SET when EXPIRE failed.

## Problem For the communicator scheduling policy, we need to understand the server-side cost of idle gRPC streams. Touches #11735. ## Summary of changes Add an `idle-streams` benchmark to `pagebench` which opens a large number of idle gRPC GetPage streams.

## Problem gRPC client retries currently include pool acquisition under the per-attempt timeout. If pool acquisition is slow (e.g. full pool), this will cause spurious timeout warnings, and the caller will lose its place in the pool queue. Touches #11735. ## Summary of changes Makes several improvements to retries and related logic: * Don't include pool acquisition time under request timeouts. * Move attempt timeouts out of `Retry` and into the closure. * Make `Retry` configurable, move constants into main module. * Don't backoff on the first retry, and reduce initial/max backoffs to 5ms and 5s respectively. * Add `with_retries` and `with_timeout` helpers. * Add slow logging for pool acquisition, and a `warn_slow` counterpart to `log_slow`. * Add debug logging for requests and responses at the client boundary.

## Problem It can take 3x the idle timeout to reap a channel. We have to wait for the idle timeout to trigger first for the stream, then the client, then the channel. Touches #11735. ## Summary of changes Reap empty channels immediately, and rely indirectly on the channel/stream timeouts. This can still lead to 2x the idle timeout for streams (first stream then client), but that's okay -- if the stream closes abruptly (e.g. due to timeout or error) we want to keep the client around in the pool for a while.

## Problem The new communicator gRPC client has significantly worse Pagebench performance than a basic gRPC client. We need to find out why. ## Summary of changes Add a `pagebench --profile` flag which takes a client CPU profile of the benchmark and writes a flamegraph to `profile.svg`.

## Problem The communicator gRPC client currently attempts to pipeline GetPage requests from multiple callers onto the same gRPC stream. This has a number of issues: * Head-of-line blocking: the request may block on e.g. layer download or LSN wait, delaying the next request. * Cancellation: we can't easily cancel in-progress requests (e.g. due to timeout or backend termination), so it may keep blocking the next request (even its own retry). * Complex stream scheduling: picking a stream becomes harder/slower, and additional Tokio tasks and synchronization is needed for stream management. Touches #11735. Requires #12579. ## Summary of changes This patch removes pipelining of gRPC stream requests, and instead prefers to scale out the number of streams to achieve the same throughput. Stream scheduling has been rewritten, and mostly follows the same pattern as the client pool with exclusive acquisition by a single caller. [Benchmarks](#12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. This has a number of advantages: * It (mostly) eliminates head-of-line blocking (except at the TCP level). * Cancellation becomes trivial, by closing the stream. * Stream scheduling becomes significantly simpler and cheaper. * Individual callers can still use client-side batching for pipelining.

## Problem We don't log the timeline ID when rolling ephemeral layers during housekeeping. Resolves [LKB-179](https://databricks.atlassian.net/browse/LKB-179) ## Summary of changes Add a span with timeline ID when calling `maybe_freeze_ephemeral_layer` from the housekeeping loop. We don't instrument the function itself, since future callers may not have a span including the tenant_id already, but we don't want to duplicate the tenant_id for these spans.

## Problem The communicator gRPC client currently uses bounded client/stream pools. This can artificially constrain clients, especially after we remove pipelining in #12584. [Benchmarks](#12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. In the worst case, we'll degenerate to the current libpq state with one stream per backend, but without the TCP connection overhead. In the common case we expect significantly lower stream counts due to stream sharing, driven e.g. by idle backends, LFC hits, read coalescing, sharding (backends typically only talk to one shard at a time), etc. Currently, Pageservers rarely serve more than 4000 backend connections, so we have at least 2 orders of magnitude of headroom. Touches #11735. Requires #12584. ## Summary of changes Remove the pool limits, and restructure the pools. We still keep a separate bulk pool for Getpage batches of >4 pages (>32 KB), with fewer streams per connection. This reduces TCP-level congestion and head-of-line blocking for non-bulk requests, and concentrates larger window sizes on a smaller set of streams/connections, presumably reducing memory usage. Apart from this, bulk requests don't have any latency penalty compared to other requests.

…enting new code. (#12567) ## Problem In #12513, the new code was implemented to retry 404 errors caused by the replication lag. However, this implemented the new logic, making the script more complicated, while we have an existing one in `neon_api.py`. ## Summary of changes The existing mechanism is used to retry 404 errors. --------- Co-authored-by: Alexey Masterov <[email protected]>

neondatabase/cloud#19011 Measure relative performance for prewarmed and non-prewarmed endpoints. Add test that runs on every commit, and one performance test with a remote cluster.

# TLDR This PR is a no-op. ## Problem When a SK loses a disk, it must recover all WALs from the very beginning. This may take days/weeks to catch up to the latest WALs for all timelines it owns. ## Summary of changes When SK starts up, if it finds that it has 0 timelines, - it will ask SC for the timeline it owns. - Then, pulls the timeline from its peer safekeepers to restore the WAL redundancy right away. After pulling timeline is complete, it will become active and accepts new WALs. The current impl is a prototype. We can optimize the impl further, e.g., parallel pull timelines. --------- Co-authored-by: Haoyu Huang <[email protected]>

Include records and image in the debug get page handler. This endpoint does not update the metrics and does not support tracing. Note that this now returns individual bytes which need to be encoded properly for debugging. Co-authored-by: Haoyu Huang <[email protected]>

We didn't consistently apply these, and it wasn't consistently solved. With this patch we should have a more consistent approach to this, and have less issues porting changes to newer versions. This also removes some potentially buggy casts to `long` from `uint64` - they could've truncated the value in systems where `long` only has 32 bits.

Tests on https://github.com/neondatabase/neon/actions/runs/16268609007/job/45930162686 time out due to pgbench init job taking more than 30 minutes to run. Increase test timeout duration to 2 hours.

All Errors that can occur during get_installed_extensions() come from tokio-postgres functions, e.g. if the database is being shut down ("FATAL: terminating connection due to administrator command"). I'm seeing a lot of such errors in the logs with the regression tests, with very verbose stack traces. The compute_ctl stack trace is pretty useless for errors originating from the Postgres connection, the error message has all the information, so stop printing the stack trace. I changed the result type of the functions to return the originating tokio_postgres Error rather than anyhow::Error, so that if we introduce other error sources to the functions where the stack trace might be useful, we'll be forced to revisit this, probably by introducing a new Error type that separates postgres errors from other errors. But this will do for now.

## Problem ## Summary of changes

## Problem Close LKB-270. This is part of our series of efforts to make sure lsn_lease API prompts clients to retry. Follow up of #12631. Slack thread w/ Vlad: https://databricks.slack.com/archives/C09254R641L/p1752677940697529 ## Summary of changes - Use `tenant_remote_mutation` API for LSN leases. Makes it consistent with new APIs added to storcon. - For 404, we now always retry because we know the tenant is to-be-attached and will eventually reach a point that we can find that tenant on the intent pageserver. - Using the `tenant_remote_mutation` API also prevents us from the case where the intent pageserver changes within the lease request. The wrapper function will error with 503 if such things happen. --------- Signed-off-by: Alex Chi Z <[email protected]>

github-actions · 2025-07-18T07:06:21Z

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

github-actions · 2025-07-18T08:07:41Z

8050 tests run: 7402 passed, 0 failed, 648 skipped (full report)

Flaky tests (3)

Postgres 16

test_ps_unavailable_after_delete[DeletionAPIKind.FORCE]: release-x86-64-with-lfc, release-arm64-with-lfc

Postgres 14

test_ps_unavailable_after_delete[DeletionAPIKind.FORCE]: release-arm64-with-lfc

Code coverage* (full report)

functions: 35.0% (8793 of 25102 functions)
lines: 46.3% (70960 of 153161 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
9ea6db2 at 2025-07-18T08:07:40.306Z :recycle:}

DimasKovas and others added 30 commits July 11, 2025 10:49

pageserver: log only on final shard resolution failure (#12565)

154f6dc

This log is too noisy. Instead of warning on every retry, let's log only on the final failure.

[proxy] allow testing query cancellation locally (#12568)

f424540

## Problem Canceelation requires redis, redis required control-plane. ## Summary of changes Make redis for cancellation not require control plane. Add instructions for setting up redis locally.

Use relative paths in submodule URL references (#12559)

a0a7733

This is a nifty trick from the hadron repo that seems to help with SSH key dance. Signed-off-by: Tristan Partin <[email protected]>

proxy: encode json as we parse rows (#11992)

9bba31b

Serialize query row responses directly into JSON. Some of this code should be using the `json::value_as_object/list` macros, but I've avoided it for now to minimize the size of the diff.

Performance test for LFC prewarm (#12524)

2288efa

neondatabase/cloud#19011 Measure relative performance for prewarmed and non-prewarmed endpoints. Add test that runs on every commit, and one performance test with a remote cluster.

LFC prewarm perftest: increase timeout for initialization job (#12594)

a456e81

Tests on https://github.com/neondatabase/neon/actions/runs/16268609007/job/45930162686 time out due to pgbench init job taking more than 30 minutes to run. Increase test timeout duration to 2 hours.

HaoyuHuang and others added 3 commits July 17, 2025 23:17

A few more SC changes (#12649)

8f627ea

## Problem ## Summary of changes

Compute release 2025-07-18 07:06 UTC

9ea6db2

vipvap requested review from a team as code owners July 18, 2025 07:06

vipvap requested review from hlinnaka, MMeent, skyzh, NanoBjorn, mtyazici, conradludgate and bayandin and removed request for a team July 18, 2025 07:06

MMeent approved these changes Jul 18, 2025

View reviewed changes

knizhnik added this pull request to the merge queue Jul 18, 2025

vipvap added fast-forward Trigger a release PR `--ff-only` merge and removed fast-forward Trigger a release PR `--ff-only` merge labels Jul 18, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 18, 2025

vipvap merged commit 9ea6db2 into release-compute Jul 18, 2025
116 checks passed

vipvap deleted the rc/release-compute/2025-07-18T07-06Z branch July 18, 2025 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compute release 2025-07-18 07:06 UTC #12653

Compute release 2025-07-18 07:06 UTC #12653

Uh oh!

vipvap commented Jul 18, 2025

Uh oh!

github-actions bot commented Jul 18, 2025

Uh oh!

github-actions bot commented Jul 18, 2025

Postgres 16

Postgres 14

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Compute release 2025-07-18 07:06 UTC #12653

Compute release 2025-07-18 07:06 UTC #12653

Uh oh!

Conversation

vipvap commented Jul 18, 2025

Uh oh!

github-actions bot commented Jul 18, 2025

Uh oh!

github-actions bot commented Jul 18, 2025

8050 tests run: 7402 passed, 0 failed, 648 skipped (full report)

Postgres 16

Postgres 14

Code coverage* (full report)

Uh oh!

Uh oh!

Uh oh!

Uh oh!