Skip to content

Draft: Communicator rewrite #12692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 298 commits into
base: main
Choose a base branch
from
Draft

Draft: Communicator rewrite #12692

wants to merge 298 commits into from

Conversation

hlinnaka
Copy link
Contributor

There's still a lot of work to be done here, but open a PR now already so that we get CI coverage on this, and compute image builds that we can try out in staging already.

bizwark and others added 30 commits May 19, 2025 06:33
…with-integrated-cache

There were conflicts because of the differences in the page_api
protocol that was merged to main vs what was on the branch. I adapted
the code for the protocol in main.
Use that instead of the half-baked Adaptive Radix Tree
implementation. ART would probably be better in the long run, but more
complicated to implement.
When the LFC is shrunk, we punch holes in the underlying file to
release the disk space to the OS. We tracked it in the same hash table
as the in-use entries, because that was convenient. However, I'm
working on being able to shrink the hash table too, and once we do
that, we'll need some other place to track the holes. Implement a
simple scheme of an in-memory array and a chain of on-disk blocks for
that.
A runtime setting is nicer, but the next commit will replace the hash
table with a different implementation that requires the value size to
be a compile-time constant.
The new implementation lives in a separately allocated shared memory
area, which could be resized. Resizing it isn't actually implemented
yet, though. It would require some co-operation from the LFC code.
hlinnaka added 30 commits July 30, 2025 17:31
This makes the test_replica_query_race test pass, and probably some
other read replica tests too.
More logs is useful during debugging, but it's time to crank it down a
notch...
I made this change to one the is_write==true case earlier already, but
the is_write==false codepath needs the same treatment.
Fixes remaining test_hot_standby.py failures
This adds a new request type between backend and communicator, to make
a getpage request at a given LSN, bypassing the LFC. Only used by the
get_raw_page_at_lsn() debugging/testing function.
Switch to the 'measured' crate everywhere in the communicator. Connect
the allocator metrics to the metrics endpoint.
compute_ctl does it based on prefer_protocol now
Pass back a suitable 'errno' from the communicator process to the
originating backend in all cases. Usually it's just EIO because we
don't have a good way to map from tonic StatusCodes to libc error
numbers. That's probably good enough; from the original backend's
perspective all errors are IO errors.

In the C code, set libc errno variable before calling ereport(), so
that errcode_for_file_access() works. And once we do that, we can
replace pg_strerror() calls with %m.
The get_num_shards() function, called from the WAL proposer, requires
it.

Fixes test_timeline_size_quota_on_startup
The error message is just a little different with gRPC.
I added a fixture to run these tests with and without grpc, but missed
passing the option to one endpoint creation.
Namely, this makes it pass with the new communicator, which doesn't do
chunking at all.
The spawned thread didn't have the tokio runtime active, which lead to
this error:

    ERROR lsn_lease_bg_task{tenant_id=1bb647cb7d3974b52e74f7442fa7d059 timeline_id=cf41456d3202e1c3940cb8f372d160ab lsn=0/1576000}:panic{thread=<unnamed> location=compute_tools/src/lsn_lease.rs:201:5}: there is no reactor running, must be called from the context of a Tokio 1.x runtime

Fixes `test_readonly_node_gc`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants