-
Notifications
You must be signed in to change notification settings - Fork 411
MSC4140: finalised delayed events, and more #19038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
MSC4140: finalised delayed events, and more #19038
Conversation
- Store sent/cancelled/failed delayed events, i.e. finalised delayed events, and support looking them up to inspect whether a delayed event was sent or not. Set limits on how many finalised events to store. - Support looking up delayed events by ID - Return 200 when retrying the same action (send/cancel) on an already-finalised delayed event, or 409 for a conflicting action - Limit how many delayed events a user may have scheduled at a time
| @@ -0,0 +1 @@ | |||
| Add more support for MSC4140, namely the ability to inspect sent, cancelled, or failed delayed events, aka "finalised" delayed events. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own reference, what is delayed events being used for?
I thought this was related to VoIP stuff (calls) and the new meta is with sticky events.
Are delayed events going to be deprecated in favor of sticky events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @@ -0,0 +1 @@ | |||
| Add more support for MSC4140, namely the ability to inspect sent, cancelled, or failed delayed events, aka "finalised" delayed events. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there Complement changes to go along with this?
I see some TestDelayedEvents Complement tests that are failing: https://github.com/element-hq/synapse/actions/runs/18411628889/job/52465384686?pr=19038
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there Complement changes to go along with this?
No, not yet. I'll add some soon.
I see some
TestDelayedEventsComplement tests that are failing
That's an unrelated failure, which has been flaky for a frustratingly long time. Maybe now is a good time to try to tackle it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem flaky. It's failed all 3 times for both SQLite and Postgres. And it's only TestDelayedEvents
| -- See the GNU Affero General Public License for more details: | ||
| -- <https://www.gnu.org/licenses/agpl-3.0.html>. | ||
|
|
||
| CREATE TABLE finalised_delayed_events ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the difference between delayed_events.is_processed and the state in finalised_delayed_events?
It looks like finalised_delayed_events holds more info but I'm not immediately seeing the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_processed is for delayed events that have just timed out and are in the process of being sent / persisted to their target room's DAG. Only once a delayed event is successfully sent does it get finalised. (Prior to this PR, it would instead get deleted from the DB.)
The purpose of tracking is_processed is to handle the edge case of the server going down after a delayed event times out, but before it gets sent. Upon server restart, the sending of any is_processed delayed events will be retried.
I'm admittedly not a big fan of this, but I couldn't find a way to make an atomic action out of a delayed event timing out & being sent. I also didn't want to fiddle with that in this PR, even if finalised events now have some redundancy with is_processed events.
| error bytea, | ||
| event_id TEXT, | ||
| finalised_ts BIGINT NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just put this in delayed_events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was to reduce the number of columns in the delayed_events table, as finalised_ts is relevant only to finalised events. In general, I wanted to keep any finalised-only columns out of the non-finalised delayed_events table, especially if more finalised-only properties get added later (which would allow a schema update to leave the non-finalised delayed_events table alone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind: 2ec54e4 (meant to post this on Friday)
synapse/storage/schema/main/delta/93/01_add_finalised_delayed_events.sql
Outdated
Show resolved
Hide resolved
| for user_localpart in self.db_pool.simple_select_onecol_txn( | ||
| txn, | ||
| "finalised_delayed_events", | ||
| keyvalues={}, | ||
| retcol="DISTINCT(user_localpart)", | ||
| ): | ||
| self._prune_excess_finalised_delayed_events_for_user( | ||
| txn, user_localpart, retention_limit | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a N+1 query problem.
Can we do this in batches in one big query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, because each prune is a 2-step process of:
- find how many finalised delayed events the target user has
- delete enough of that user's FDEs to stay within the per-user retention limit
I'm not sure of a better way to do it that would involve only a single query.
What should help is the fact this pruning is always done in a transaction, together with whatever other queries need to happen before/after it.
| -- See the GNU Affero General Public License for more details: | ||
| -- <https://www.gnu.org/licenses/agpl-3.0.html>. | ||
|
|
||
| CREATE TABLE finalised_delayed_events ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this separate table is creating the need for a lot of sub-queries (SELECT in SELECT statements) which seems like a smell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative would be to add finalised-related columns to the existing delayed_events table.
I went with the current approach under the belief that it's better to avoid adding many columns to a single table. If that is an ill-founded belief, I will move these columns into the other table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been stuck in the details as you have but I do know that this current solution doesn't spark joy. Your current approach could be the way to solve it although my gut feeling makes me seek some better alternative.
From my eye, it seems like a delayed event should have a status enum all in one schema.
The alternative would be to add finalised-related columns to the existing
delayed_eventstable.
Optimizing for number of columns doesn't seem like something to worry about. We should more worry about tables growing beyond their concern (trying to handle too many things).
Given that moving the columns to delayed_events also has a benefit in how we can interact with the data, it seems like this is another good reason to go that route.
Giving better advice, probably means I'd have to try solving it myself which I don't want to do. I'll leave the problem-solving on this one in your capable hands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice on this. I gave it a try with 2ec54e4, and it does end up being much cleaner.
What would have made separate tables more appropriate is if there were attributes relevant only to scheduled delayed events but not finalised ones, and vice-versa.
Instead of storing details on finalised delayed events in a new, standalone table that must be joined on the base table of delayed events, add columns to the base table to track those details.
ea6b3ca to
048508c
Compare
This satisfies the check-schema-delta script. The index is present for speeding up ORDER BY on finalised_ts.
This should also fix the portdb script
| CREATE INDEX delayed_events_finalised_ts ON delayed_events (finalised_ts); | ||
| INSERT INTO background_updates (update_name, progress_json) VALUES | ||
| ('delayed_events_finalised_ts', '{}'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had to be added to pass the check-schema-delta job.
Given that this is an index on a new, null-by-default column, is there really a need to add it in the background?
MSC4140: finalised delayed events, and more
Dev notes
Delayed events initially introduced in #17326
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.