Retry certain HTTP replication failures (dns, could not connect)

**Description:**

We run synapse with workers in Kubernetes, where event persister workers are represented as a `StatefulSet`. Events must be persisted by a specific worker, ie - having 2 event persisters does not enable high availability.

This means that during an update there is a small (~60s) window during which replication HTTP requests to a worker currently restarting will fail either with a DNS resolution failure (K8s has not created the new DNS record for the new pod) or some kind of could not connect failure due to the process not yet listening for traffic (need to confirm the exact exception here).

There [already exists logic to retry on timeouts](https://github.com/matrix-org/synapse/blob/develop/synapse/replication/http/_base.py#L255), which is the default for all HTTP replication requests, which implies retrying requests is a safe operation.

Therefore I think it'd be great to retry on such cases above as well, probably with some exponential backoff and a time limit so requests don't just hang indefinitely/timeout on the client. I'll have a go at implementing a POC roughly along these lines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Retry certain HTTP replication failures (dns, could not connect) #12178

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Retry certain HTTP replication failures (dns, could not connect) #12178

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions