-
Notifications
You must be signed in to change notification settings - Fork 738
Draft: Communicator rewrite #12692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
hlinnaka
wants to merge
298
commits into
main
Choose a base branch
from
communicator-rewrite
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Draft: Communicator rewrite #12692
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…with-integrated-cache There were conflicts because of the differences in the page_api protocol that was merged to main vs what was on the branch. I adapted the code for the protocol in main.
Use that instead of the half-baked Adaptive Radix Tree implementation. ART would probably be better in the long run, but more complicated to implement.
When the LFC is shrunk, we punch holes in the underlying file to release the disk space to the OS. We tracked it in the same hash table as the in-use entries, because that was convenient. However, I'm working on being able to shrink the hash table too, and once we do that, we'll need some other place to track the holes. Implement a simple scheme of an in-memory array and a chain of on-disk blocks for that.
A runtime setting is nicer, but the next commit will replace the hash table with a different implementation that requires the value size to be a compile-time constant.
The new implementation lives in a separately allocated shared memory area, which could be resized. Resizing it isn't actually implemented yet, though. It would require some co-operation from the LFC code.
The new communicator has its own tracking
This makes the test_replica_query_race test pass, and probably some other read replica tests too.
More logs is useful during debugging, but it's time to crank it down a notch...
I made this change to one the is_write==true case earlier already, but the is_write==false codepath needs the same treatment.
Fixes remaining test_hot_standby.py failures
This adds a new request type between backend and communicator, to make a getpage request at a given LSN, bypassing the LFC. Only used by the get_raw_page_at_lsn() debugging/testing function.
Switch to the 'measured' crate everywhere in the communicator. Connect the allocator metrics to the metrics endpoint.
compute_ctl does it based on prefer_protocol now
Pass back a suitable 'errno' from the communicator process to the originating backend in all cases. Usually it's just EIO because we don't have a good way to map from tonic StatusCodes to libc error numbers. That's probably good enough; from the original backend's perspective all errors are IO errors. In the C code, set libc errno variable before calling ereport(), so that errcode_for_file_access() works. And once we do that, we can replace pg_strerror() calls with %m.
The get_num_shards() function, called from the WAL proposer, requires it. Fixes test_timeline_size_quota_on_startup
The error message is just a little different with gRPC.
I added a fixture to run these tests with and without grpc, but missed passing the option to one endpoint creation.
Namely, this makes it pass with the new communicator, which doesn't do chunking at all.
The spawned thread didn't have the tokio runtime active, which lead to this error: ERROR lsn_lease_bg_task{tenant_id=1bb647cb7d3974b52e74f7442fa7d059 timeline_id=cf41456d3202e1c3940cb8f372d160ab lsn=0/1576000}:panic{thread=<unnamed> location=compute_tools/src/lsn_lease.rs:201:5}: there is no reactor running, must be called from the context of a Tokio 1.x runtime Fixes `test_readonly_node_gc`
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There's still a lot of work to be done here, but open a PR now already so that we get CI coverage on this, and compute image builds that we can try out in staging already.