Homeservers don't catch up with missed traffic until someone sends another event #2528

richvdh · 2017-10-11T10:05:44Z

If a homeserver goes offline, it may miss events. There is then no reliable means for it to catch up. Currently we rely on someone else sending an event in the room, which will then make it do a backfill.

Potential solutions include:

Proactively sending the data once we hear the server is back online (We should retry pending transactions when we hear from a server which was offline #2526)
Making /pull work properly, and use it (federation/v1/pull/ doesn't seem to work #2527)

ara4n · 2019-06-12T14:03:34Z

some keywords: federation outage, resync room, server offline, recover

OlegGirko · 2019-08-22T19:56:20Z

some keywords: federation outage, resync room, server offline, recover

This is not just about temporary outage with following recovery. Partial netsplit can exist by design during transition to another underlying networking protocol. Imagine nodes that exist in I2P network only (or any other privacy-oriented network). These nodes can not communicate directly to normal HTTPS-based nodes that are not I2P-aware, but messages can be relayed through nodes that are dual-stack.

I think, partial netsplit should be considered not a temporary outage, but part of normal functioning.

ara4n · 2020-01-03T18:53:50Z

i'm guessing this is the cause of https://twitter.com/jomatv6/status/1213161573759541256.

The fact we don't retry transactions when a server reappears seems very worthwhile (especially in a p2p world).

erikjohnston · 2020-02-03T10:27:13Z

To limit the number of events we send we can limit ourselves to only sending the most recent N events for each room there are missed updates in.

Side note: sending old events down /send will cause the remote server and clients to think the event is "new" and notify users as normal. This may or may not be desired

ara4n · 2020-02-19T12:09:31Z

just missed a relatively important message from @benbz thanks to this >:(

ara4n · 2020-06-01T15:30:26Z

re-triaging this, because (as per the examples in the comments here), it's pretty disasterous that messages can get just stuck forever because of a brief federation outage. I get bitten by this every few weeks.

MayeulC · 2020-06-03T17:01:35Z

I will put my thoughts down regarding this. I don't really know Synapse's internal architecture, but here is what I imagine:

Homeservers could keep a list of other homeservers they are in contact with, as well as the last successful connection. Or the timestamp of the first error, if you want to update the table less often.

For every event:

Homeserver attempts to send an event
Event sending fails
The homeserver add the even to the list of failed events, updates the last_lailure timestamp for that homeserver IF it is NULL.

In parallel, exponential backoff is implemented at the homeserver level, not at the room level.

Check every homeserver in the list where the last_failure is not NULL (or NULL, and for which the missed events table is not empty).
Apply exponential backoff rules to determine when will the next attempt be
Try contacting again every server that meets the condition
Send every missed event to the homeservers that reply

And incoming connections can be a source of truth: when receiving a connection from a server that was previously down, clear the last_error field.

A couple issues/thoughts with that implementation:

Requires storing a large number of events. This can be bypassed simply by storing a room id instead of an event where applicable, and sending a dummy event to the server once it comes back online, which will make it backfill
I am afraid this scheme might not work well for misconfigured servers (NAT, firewall, incorrect DNS records) that can only send, and not receive, data. The last_failure would be cleared when they send an event, and put back right away when the homeserver tries to contact them
Servers that have been offline for a while will get hammered with requests once they send back something. Maybe a random waiting time could be adjusted depending on the number of events to send? The servers will schedule backfilling as they see fit, anyway.

This is even more important if we think about a p2p world, and is one of the glaring issues with Matrix at present.

In parallel, it would be nice for servers to schedule a sync at startup.

neilisfragile · 2020-06-04T20:05:20Z

In case it is helpful, here is an implementation in Dendrite.
matrix-org/dendrite#1077

neilisfragile · 2020-06-24T14:51:37Z

Next steps:

Assess the above
Propose a solution and review from the core team
Fix it

reivilibre · 2020-07-08T15:41:51Z

Making /pull work properly, and use it (#2527)

In #2527, you have said /pull is no longer a thing.

Proactively sending the data once we hear the server is back online (#2526)

In #2526, you have said 'And the fact we hear from a remote HS doesn't necessarily mean it's ready to receive unsent transactions. #2527 may be a better solution.'... oops.

In any case, barring any new ideas, I take it we are left with #2526.

I'm tempted to take this on, but don't know if I know enough about federation (but it might be a worthwhile chance to learn!).

I'm tempted to also believe that only the recovering homeserver is in a best position to know about whether it is 'ready' to catch up.

The idea of 'hearing that a homeserver is back online' seems a tad frail for some low-traffic cases; would it be the case that a recovering HS should go and prod others so it gets noticed?

I will study the Dendrite implementation.

richvdh · 2020-07-12T22:21:41Z

I've been doing a bit of thinking about this over the weekend.

First of all I think there are a couple of essential requirements for any solution here:

It must not interfere with the current mechanism by which matrix rooms can have "gaps" in their history. That is to say, if my homeserver goes offline for a month, I do not expect to be bombarded with an entire month's history of events when it comes back. What I do want is the last few events in each room I'm in, together with any invites I've missed.

In other words: retrying the same messages over and over again until they are successful as described in @MayeulC's comment above is a non-solution here.
It must be resilient to synapse restarts. Suppose server B goes down; server A has some events which it is trying to send to server B. It is no good if server A forgets about those events when it is restarted. In other words: we need to keep enough information in the database that server A knows which rooms it needs to catch server B up on when it comes back.

Now, I see two sides to this problem. The first is how we decide to resend data over over to the other homeserver; the second is figuring out what that data should be.

The first part is relatively easy: as per the opening comment, options include having the other server pull or resending when we get an incoming request from the remote server. Another option might just be an extended retry schedule.

We essentially already have this mechanism today: when we successfully receive an incoming transaction, we will try sending any device list updates or to-device messages (see https://github.com/matrix-org/synapse/blob/release-v1.16.1/synapse/federation/sender/__init__.py#L494). To my mind, this is by far the most promising option for a hook for resending room events too, although as noted in #2526 (comment) and in @MayeulC 's comment above, this solution not without its downsides. Still, it'll be fine for a first pass.

(Incidentally, it looks to me as though matrix-org/dendrite#1077 is all about this first part of deciding when to send new data, rather than what to send. It also appears to be focussed somewhat on the P2P case, though maybe I'm missing something. We should check with the Dendrite folks - and indeed other HS developers - if they have a more complete solution to this problem though.)

So, the second part of the problem: once server A decides (by whatever mechanism) that server B is back in the game, how does it know what to send?

An idea:

Currently, synapse's outgoing federation logic iterates though the event stream in stream_ordering order, deciding where each event in that stream ought to be sent to. (see https://github.com/matrix-org/synapse/blob/release-v1.16.1/synapse/federation/sender/__init__.py#L153 and following). I propose that we maintain a table destination_rooms mapping (destination, room_id) to latest stream_ordering, which we update each time an event passes through this event loop.

We also add to the current destinations table a last_successful_stream_ordering column which records the stream_ordering of the last event successfully sent to that destination - updated after each successful transaction transmission.

So then, in wake_destination, we compare the last_successful_stream_ordering of the awoken destination with that of each room in destination_rooms, and thereby get a list of rooms which need an update. For each such room, we send over the event with the relevant stream_ordering. This should trigger a backfill request from the other end.

Outstanding problems:

once we get an incoming request, we will clear the backoff, which might mean a regular outgoing transaction overwrites last_successful_stream_ordering before we get a chance to do the catchup. How can we avoid this?

erikjohnston · 2020-07-13T11:10:16Z

That all broadly makes sense. I think the interesting thing to note is that when catching up we're going to have to support doing so in batches (to stay under the event limit). That sort of implies that we need to keep around ranges in the DB about gaps we still need to send to a remote server.

I think one way of doing that is to also have a table destination_event_gaps of destination, start_stream_ordering and end_stream_ordering, where each rows is a known gap. Then:

After successfully sending a transaction, we:
a. update destination.last_successful_stream_ordering with latest stream ordering in transaction,
b. update destination_event_gaps.end_stream_ordering with the first stream ordering in transaction for any rows where end_stream_ordering is NULL for that destination.
After a transaction fails to send and we drop it, we insert a row destination_event_gaps for the destination if one does not exist with a NULL end_stream_ordering setting start_stream_ordering to destination.last_successful_stream_ordering.

This way the destination_event_gaps is a table the records all gaps, where a NULL end_stream_ordering means that the gap extends to "now". If a remote server is accepting transactions has an entry in destination_event_gaps then we can include events in the transaction from the gap, updating (or removing) the row after successfully sending the transaction.

On startup we could also go through all destinations and see if there are any entries in destination_rooms to see if any events failed to send before we last stopped, inserting them into destination_event_gaps, which will then get picked up as part of the process to send missed events.

richvdh · 2020-07-13T14:02:27Z

I think the interesting thing to note is that when catching up we're going to have to support doing so in batches (to stay under the event limit). That sort of implies that we need to keep around ranges in the DB about gaps we still need to send to a remote server.

As discussed in #synapse-dev: there's an alternative solution to this, which seems simpler and I'd like to propose as an initial implementation.

Essentially, we make sure to catch up with the least-recently-updated rooms first. In other words, on each transaction attempt, we do SELECT room_id, stream_ordering FROM destination_rooms WHERE stream_ordering > <last_successful_stream_ordering> ORDER BY stream_ordering LIMIT 50, and only start popping events off the in-memory queue once that query returns no results.

A concern with this mechanism is that we might never be able to catch up with real time in this way, since it's much slower for a server to have to request missed data with get_missing_events than it is for it to be pushed that data in order. Nevertheless I think this is worth a go; if it proves unworkable we may have to do as Erik suggests to allow us to continue pushing new data in real time at the same time as we catch up with older rooms.

this morning: More Sygnal#130 (HTTP proxy) rework, I'm feeling it straightening out a lot so hopefully it'll be back in the queue soon; today: 'More of that'; Catch up on #2528 and #7828 which Riiich has suggested solving first Github matrix-org/synapse#2528 : Homeservers don't catch up with missed traffic until someone sends another event matrix-org/synapse#7828 : in-memory federation transaction transmission queues build up indefinitely for offline servers

reivilibre · 2020-07-16T07:40:38Z

One note that may be worth considering (but probably not for now)

It must not interfere with the current mechanism by which matrix rooms can have "gaps" in their history. That is to say, if my homeserver goes offline for a month, I do not expect to be bombarded with an entire month's history of events when it comes back. What I do want is the last few events in each room I'm in, together with any invites I've missed.

There is a little issue in this in that a homeserver may miss events for which notifications should be generated, e.g. mentions.

Perhaps this sounds silly for a month-long outage but if we take the example of an outage of a few hours on a moderately-busy room, it wouldn't sound silly to me to say that there was a reliability problem here if I didn't get notified when someone mentions me.

richvdh · 2020-09-23T17:33:43Z

I think this is fixed by #8272 and previous PRs.

richvdh mentioned this issue Oct 11, 2017

Tor only received PMs from me (weeks old) on his HS after he sent me a message; federation had got stuck somehow (SYN-439) #1351

Closed

ara4n added z-bug (Deprecated Label) p1 labels May 4, 2019

richvdh mentioned this issue May 29, 2019

On startup handle unprocessed events. (SYN-122) #1240

Closed

ara4n mentioned this issue Jun 12, 2019

Proposal: mitigate extremities accumulation by intelligently pruning extremities which do not contribute to state resolution results #5438

Closed

richvdh mentioned this issue Jul 17, 2019

Reproduceable UISIs during federation failures. #5441

Closed

ara4n mentioned this issue Sep 21, 2019

Thoughts on improving chat systems eeeeeta/blog-comments#5

Open

richvdh removed the p1 label Feb 20, 2020

jakehemmerle mentioned this issue Mar 16, 2020

Anonymous Homeservers (Tor/I2P) #7088

Open

ara4n added the p1 label Jun 1, 2020

richvdh added z-p2 (Deprecated Label) p1 and removed p1 z-p2 (Deprecated Label) labels Jun 4, 2020

richvdh mentioned this issue Jul 13, 2020

in-memory federation transaction transmission queues build up indefinitely for offline servers #7828

Closed

richvdh assigned reivilibre Jul 13, 2020

erikjohnston mentioned this issue Jul 16, 2020

Drop federation transmission queues during a significant remote outage. #7864

Merged

4 tasks

reivilibre mentioned this issue Aug 18, 2020

Catch-up after Federation Outage #8096

Closed

reivilibre linked a pull request Sep 3, 2020 that will close this issue

Catch-up after Federation Outage (split, 1) #8230

Merged

reivilibre mentioned this issue Sep 3, 2020

Catch up after Federation Outage (split, 2): Track last successful stream ordering after transmission #8247

Merged

reivilibre linked a pull request Sep 3, 2020 that will close this issue

Catch up after Federation Outage (split, 2): Track last successful stream ordering after transmission #8247

Merged

reivilibre closed this as completed in #8230 Sep 4, 2020

reivilibre reopened this Sep 4, 2020

This was unlinked from pull requests Sep 4, 2020

Catch-up after Federation Outage (split, 1) #8230

Merged

Catch up after Federation Outage (split, 2): Track last successful stream ordering after transmission #8247

Merged

reivilibre mentioned this issue Sep 4, 2020

Catch-up after Federation Outage (split, 3): Add tests for last_successful_stream_ordering #8258

Merged

richvdh mentioned this issue Sep 10, 2020

Catch-up after Federation Outage (split, 4): catch-up loop #8272

Merged

reivilibre mentioned this issue Sep 17, 2020

Catch-up after Federation Outage (bonus): Catch-up on Synapse Startup #8322

Merged

richvdh closed this as completed Sep 23, 2020

richvdh mentioned this issue Nov 30, 2020

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homeservers don't catch up with missed traffic until someone sends another event #2528

Homeservers don't catch up with missed traffic until someone sends another event #2528

richvdh commented Oct 11, 2017

ara4n commented Jun 12, 2019

OlegGirko commented Aug 22, 2019

ara4n commented Jan 3, 2020

erikjohnston commented Feb 3, 2020

ara4n commented Feb 19, 2020

ara4n commented Jun 1, 2020

MayeulC commented Jun 3, 2020 •

edited

Loading

neilisfragile commented Jun 4, 2020

neilisfragile commented Jun 24, 2020

reivilibre commented Jul 8, 2020

richvdh commented Jul 12, 2020

erikjohnston commented Jul 13, 2020

richvdh commented Jul 13, 2020

reivilibre commented Jul 16, 2020

richvdh commented Sep 23, 2020

Homeservers don't catch up with missed traffic until someone sends another event #2528

Homeservers don't catch up with missed traffic until someone sends another event #2528

Comments

richvdh commented Oct 11, 2017

ara4n commented Jun 12, 2019

OlegGirko commented Aug 22, 2019

ara4n commented Jan 3, 2020

erikjohnston commented Feb 3, 2020

ara4n commented Feb 19, 2020

ara4n commented Jun 1, 2020

MayeulC commented Jun 3, 2020 • edited Loading

neilisfragile commented Jun 4, 2020

neilisfragile commented Jun 24, 2020

reivilibre commented Jul 8, 2020

richvdh commented Jul 12, 2020

erikjohnston commented Jul 13, 2020

richvdh commented Jul 13, 2020

reivilibre commented Jul 16, 2020

richvdh commented Sep 23, 2020

MayeulC commented Jun 3, 2020 •

edited

Loading