-
Notifications
You must be signed in to change notification settings - Fork 739
Compute release 2025-07-18 07:06 UTC #12653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vipvap
merged 69 commits into
release-compute
from
rc/release-compute/2025-07-18T07-06Z
Jul 18, 2025
Merged
Compute release 2025-07-18 07:06 UTC #12653
vipvap
merged 69 commits into
release-compute
from
rc/release-compute/2025-07-18T07-06Z
Jul 18, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…nds (#12548) ## Problem We have a `safekeeper_migrate` handler, but no subcommand in `storcon_cli`. Same for `/:timeline_id/locate` for identifying current set of safekeepers. - Closes: #12395 ## Summary of changes - Add `timeline-safekeeper-migrate` and `timeline-locate` subcommands to `storcon_cli`
…12547) This is a no-op for the neon deployment * Introduce the concept image consistent lsn: of the largest LSN below which all pages have been redone successfully * Use the image consistent LSN for forced image layer creations * Optionally expose the image consistent LSN via the timeline describe HTTP endpoint * Add a sharded timeline describe endpoint to storcon --------- Co-authored-by: Chen Luo <[email protected]>
This log is too noisy. Instead of warning on every retry, let's log only on the final failure.
…12550) Split the functions into two: one internal function to calculate the estimate, and another (two functions) to expose it as SQL functions. This is in preparation of adding new communicator implementation. With that, the SQL functions will dispatch the call to the old or new implementation depending on which is being used.
## Problem Canceelation requires redis, redis required control-plane. ## Summary of changes Make redis for cancellation not require control plane. Add instructions for setting up redis locally.
This is a nifty trick from the hadron repo that seems to help with SSH key dance. Signed-off-by: Tristan Partin <[email protected]>
Update the WSS estimate before acquring the lock, so that we don't need to hold the lock for so long. That seems safe to me, see added comment. I was planning to do this with the new rust-based communicator implementation anyway, but it might help a little with the current C implementation too. And more importantly, having this as a separate PR gives us a chance to review this aspect independently.
… progress (#12523) Fixes [LKB-61](https://databricks.atlassian.net/browse/LKB-61): `test_timeline_archival_chaos` being flaky with storcon error `Requested tenant is missing`. When a tenant migration is ongoing, and the attach request has been sent to the new location, but the attach hasn't finished yet, it is possible for the pageserver to return a 412 precondition failed HTTP error on timeline deletion, because it is being sent to the new location already. That one we would previously log via sth like: ``` ERROR request{method=DELETE path=/v1/tenant/1f544a11c90d1afd7af9b26e48985a4e/timeline/32818fb3ebf07cb7f06805429d7dee38 request_id=c493c04b-7f33-46d2-8a65-aac8a5516055}: Error processing HTTP request: InternalServerError(Error deleting timeline 32 818fb3ebf07cb7f06805429d7dee38 on 1f544a11c90d1afd7af9b26e48985a4e on node 2 (localhost): pageserver API: Precondition failed: Requested tenant is missing ``` This patch changes that and makes us return a more reasonable resource unavailable error. Not sure how scalable this is with tenants with a large number of shards, but that's a different discussion (we'd probably need a limited amount of per-storcon retries). example [link](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12398/15981821532/index.html#/testresult/e7785dfb1238d92f).
## Problem close LKB-162 close neondatabase/cloud#30665, related to neondatabase/cloud#29434 We see a lot of errors like: ``` 2025-05-22T23:06:14.928959Z ERROR compaction_loop{tenant_id=? shard_id=0304}:run:gc_compact_timeline{timeline_id=?}: error applying 4 WAL records 35/DC0DF0B8..3B/E43188C0 (8119 bytes) to key 000000067F0000400500006027000000B9D0, from base image with LSN 0/0 to reconstruct page image at LSN 61/150B9B20 n_attempts=0: apply_wal_records Caused by: 0: read walredo stdout 1: early eof ``` which is an acceptable form of error and we should downgrade it to warning. ## Summary of changes walredo error during gc-compaction is expected when the data below the gc horizon does not contain a full key history. This is possible in some rare cases of gc that is only able to remove data in the middle of the history but not all earlier history when a full keyspace gets deleted. Signed-off-by: Alex Chi Z <[email protected]>
Even though we're now part of Databricks, let's at least make this part consistent. ## Summary of changes - PG14: neondatabase/postgres#669 - PG15: neondatabase/postgres#670 - PG16: neondatabase/postgres#671 - PG17: neondatabase/postgres#672 --------- Co-authored-by: Arpad Müller <[email protected]>
# TLDR Problem-I is a bug fix. The rest are no-ops. ## Problem I Page server checks image layer creation based on the elapsed time but this check depends on the current logical size, which is only computed on shard 0. Thus, for non-0 shards, the check will be ineffective and image creation will never be done for idle tenants. ## Summary of changes I This PR fixes the problem by simply removing the dependency on current logical size. ## Summary of changes II This PR adds a timeout when calling page server to split shard to make sure SC does not wait for the API call forever. Currently the PR doesn't adds any retry logic because it's not clear whether page server shard split can be safely retried if the existing operation is still ongoing or left the storage in a bad state. Thus it's better to abort the whole operation and restart. ## Problem III `test_remote_failures` requires PS to be compiled in the testing mode. For PS in dev/staging, they are compiled without this mode. ## Summary of changes III Remove the restriction and also increase the number of total failures allowed. ## Summary of changes IV remove test on PS getpage http route. --------- Co-authored-by: Chen Luo <[email protected]> Co-authored-by: Yecheng Yang <[email protected]> Co-authored-by: Vlad Lazar <[email protected]>
#12553) ## Problem To store cancellation data we send two commands to redis because the redis server version doesn't support HSET with EX. Also, HSET is not really needed. ## Summary of changes * Replace the HSET + EXPIRE command pair with one SET .. EX command. * Replace HGET with GET. * Leave a workaround for old keys set with HSET. * Replace some anyhow errors with specific errors to surface the WRONGTYPE error from redis.
Serialize query row responses directly into JSON. Some of this code should be using the `json::value_as_object/list` macros, but I've avoided it for now to minimize the size of the diff.
…2546) ## Problem We don't validate the validity of the `new_sk_set` before starting the migration. It is validated later, so the migration to an invalid safekeeper set will fail anyway. But at this point we might already commited an invalid `new_sk_set` to the database and there is no `abort` command yet (I ran into this issue in neon_local and ruined the timeline :) - Part of #11669 ## Summary of changes - Add safekeeper count and safekeeper duplication checks before starting the migration - Test that we validate the `new_sk_set` before starting the migration - Add `force` option to the `TimelineSafekeeperMigrateRequest` to disable not-mandatory checks
## Problem When a connection terminates its maintain_cancel_key task keeps running until the CANCEL_KEY_REFRESH sleep finishes and then it triggers another cancel key TTL refresh before exiting. ## Summary of changes * Check for cancellation while sleeping and interrupt sleep. * If cancelled, break the loop, don't send a refresh cmd.
## Problem When refreshing cancellation data we resend the entire value again just to reset the TTL, which causes unnecessary load in proxy, on network and possibly on redis side. ## Summary of changes * Switch from using SET with full value to using EXPIRE to reset TTL. * Add a tiny delay between retries to prevent busy loop. * Shorten CancelKeyOp variants: drop redundant suffix. * Retry SET when EXPIRE failed.
## Problem For the communicator scheduling policy, we need to understand the server-side cost of idle gRPC streams. Touches #11735. ## Summary of changes Add an `idle-streams` benchmark to `pagebench` which opens a large number of idle gRPC GetPage streams.
## Problem gRPC client retries currently include pool acquisition under the per-attempt timeout. If pool acquisition is slow (e.g. full pool), this will cause spurious timeout warnings, and the caller will lose its place in the pool queue. Touches #11735. ## Summary of changes Makes several improvements to retries and related logic: * Don't include pool acquisition time under request timeouts. * Move attempt timeouts out of `Retry` and into the closure. * Make `Retry` configurable, move constants into main module. * Don't backoff on the first retry, and reduce initial/max backoffs to 5ms and 5s respectively. * Add `with_retries` and `with_timeout` helpers. * Add slow logging for pool acquisition, and a `warn_slow` counterpart to `log_slow`. * Add debug logging for requests and responses at the client boundary.
## Problem It can take 3x the idle timeout to reap a channel. We have to wait for the idle timeout to trigger first for the stream, then the client, then the channel. Touches #11735. ## Summary of changes Reap empty channels immediately, and rely indirectly on the channel/stream timeouts. This can still lead to 2x the idle timeout for streams (first stream then client), but that's okay -- if the stream closes abruptly (e.g. due to timeout or error) we want to keep the client around in the pool for a while.
## Problem The new communicator gRPC client has significantly worse Pagebench performance than a basic gRPC client. We need to find out why. ## Summary of changes Add a `pagebench --profile` flag which takes a client CPU profile of the benchmark and writes a flamegraph to `profile.svg`.
## Problem The communicator gRPC client currently attempts to pipeline GetPage requests from multiple callers onto the same gRPC stream. This has a number of issues: * Head-of-line blocking: the request may block on e.g. layer download or LSN wait, delaying the next request. * Cancellation: we can't easily cancel in-progress requests (e.g. due to timeout or backend termination), so it may keep blocking the next request (even its own retry). * Complex stream scheduling: picking a stream becomes harder/slower, and additional Tokio tasks and synchronization is needed for stream management. Touches #11735. Requires #12579. ## Summary of changes This patch removes pipelining of gRPC stream requests, and instead prefers to scale out the number of streams to achieve the same throughput. Stream scheduling has been rewritten, and mostly follows the same pattern as the client pool with exclusive acquisition by a single caller. [Benchmarks](#12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. This has a number of advantages: * It (mostly) eliminates head-of-line blocking (except at the TCP level). * Cancellation becomes trivial, by closing the stream. * Stream scheduling becomes significantly simpler and cheaper. * Individual callers can still use client-side batching for pipelining.
## Problem We don't log the timeline ID when rolling ephemeral layers during housekeeping. Resolves [LKB-179](https://databricks.atlassian.net/browse/LKB-179) ## Summary of changes Add a span with timeline ID when calling `maybe_freeze_ephemeral_layer` from the housekeeping loop. We don't instrument the function itself, since future callers may not have a span including the tenant_id already, but we don't want to duplicate the tenant_id for these spans.
## Problem The communicator gRPC client currently uses bounded client/stream pools. This can artificially constrain clients, especially after we remove pipelining in #12584. [Benchmarks](#12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. In the worst case, we'll degenerate to the current libpq state with one stream per backend, but without the TCP connection overhead. In the common case we expect significantly lower stream counts due to stream sharing, driven e.g. by idle backends, LFC hits, read coalescing, sharding (backends typically only talk to one shard at a time), etc. Currently, Pageservers rarely serve more than 4000 backend connections, so we have at least 2 orders of magnitude of headroom. Touches #11735. Requires #12584. ## Summary of changes Remove the pool limits, and restructure the pools. We still keep a separate bulk pool for Getpage batches of >4 pages (>32 KB), with fewer streams per connection. This reduces TCP-level congestion and head-of-line blocking for non-bulk requests, and concentrates larger window sizes on a smaller set of streams/connections, presumably reducing memory usage. Apart from this, bulk requests don't have any latency penalty compared to other requests.
…enting new code. (#12567) ## Problem In #12513, the new code was implemented to retry 404 errors caused by the replication lag. However, this implemented the new logic, making the script more complicated, while we have an existing one in `neon_api.py`. ## Summary of changes The existing mechanism is used to retry 404 errors. --------- Co-authored-by: Alexey Masterov <[email protected]>
neondatabase/cloud#19011 Measure relative performance for prewarmed and non-prewarmed endpoints. Add test that runs on every commit, and one performance test with a remote cluster.
# TLDR This PR is a no-op. ## Problem When a SK loses a disk, it must recover all WALs from the very beginning. This may take days/weeks to catch up to the latest WALs for all timelines it owns. ## Summary of changes When SK starts up, if it finds that it has 0 timelines, - it will ask SC for the timeline it owns. - Then, pulls the timeline from its peer safekeepers to restore the WAL redundancy right away. After pulling timeline is complete, it will become active and accepts new WALs. The current impl is a prototype. We can optimize the impl further, e.g., parallel pull timelines. --------- Co-authored-by: Haoyu Huang <[email protected]>
Include records and image in the debug get page handler. This endpoint does not update the metrics and does not support tracing. Note that this now returns individual bytes which need to be encoded properly for debugging. Co-authored-by: Haoyu Huang <[email protected]>
We didn't consistently apply these, and it wasn't consistently solved. With this patch we should have a more consistent approach to this, and have less issues porting changes to newer versions. This also removes some potentially buggy casts to `long` from `uint64` - they could've truncated the value in systems where `long` only has 32 bits.
Tests on https://github.com/neondatabase/neon/actions/runs/16268609007/job/45930162686 time out due to pgbench init job taking more than 30 minutes to run. Increase test timeout duration to 2 hours.
All Errors that can occur during get_installed_extensions() come from tokio-postgres functions, e.g. if the database is being shut down ("FATAL: terminating connection due to administrator command"). I'm seeing a lot of such errors in the logs with the regression tests, with very verbose stack traces. The compute_ctl stack trace is pretty useless for errors originating from the Postgres connection, the error message has all the information, so stop printing the stack trace. I changed the result type of the functions to return the originating tokio_postgres Error rather than anyhow::Error, so that if we introduce other error sources to the functions where the stack trace might be useful, we'll be forced to revisit this, probably by introducing a new Error type that separates postgres errors from other errors. But this will do for now.
## Problem ## Summary of changes
## Problem Close LKB-270. This is part of our series of efforts to make sure lsn_lease API prompts clients to retry. Follow up of #12631. Slack thread w/ Vlad: https://databricks.slack.com/archives/C09254R641L/p1752677940697529 ## Summary of changes - Use `tenant_remote_mutation` API for LSN leases. Makes it consistent with new APIs added to storcon. - For 404, we now always retry because we know the tenant is to-be-attached and will eventually reach a point that we can find that tenant on the intent pageserver. - Using the `tenant_remote_mutation` API also prevents us from the case where the intent pageserver changes within the lease request. The wrapper function will error with 503 if such things happen. --------- Signed-off-by: Alex Chi Z <[email protected]>
If this PR added a GUC in the Postgres fork or
If you're an external contributor, a Neon employee will assist in |
8050 tests run: 7402 passed, 0 failed, 648 skipped (full report)Flaky tests (3)Postgres 16
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
9ea6db2 at 2025-07-18T08:07:40.306Z :recycle: |
MMeent
approved these changes
Jul 18, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.