Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Homeservers don't catch up with missed traffic until someone sends another event #2528

Closed
richvdh opened this issue Oct 11, 2017 · 15 comments
Closed
Assignees
Labels
z-bug (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Oct 11, 2017

If a homeserver goes offline, it may miss events. There is then no reliable means for it to catch up. Currently we rely on someone else sending an event in the room, which will then make it do a backfill.

Potential solutions include:

@ara4n
Copy link
Member

ara4n commented Jun 12, 2019

some keywords: federation outage, resync room, server offline, recover

@OlegGirko
Copy link
Contributor

some keywords: federation outage, resync room, server offline, recover

This is not just about temporary outage with following recovery. Partial netsplit can exist by design during transition to another underlying networking protocol. Imagine nodes that exist in I2P network only (or any other privacy-oriented network). These nodes can not communicate directly to normal HTTPS-based nodes that are not I2P-aware, but messages can be relayed through nodes that are dual-stack.

I think, partial netsplit should be considered not a temporary outage, but part of normal functioning.

@ara4n
Copy link
Member

ara4n commented Jan 3, 2020

i'm guessing this is the cause of https://twitter.com/jomatv6/status/1213161573759541256.

The fact we don't retry transactions when a server reappears seems very worthwhile (especially in a p2p world).

@erikjohnston
Copy link
Member

To limit the number of events we send we can limit ourselves to only sending the most recent N events for each room there are missed updates in.

Side note: sending old events down /send will cause the remote server and clients to think the event is "new" and notify users as normal. This may or may not be desired

@ara4n
Copy link
Member

ara4n commented Feb 19, 2020

just missed a relatively important message from @benbz thanks to this >:(

@ara4n
Copy link
Member

ara4n commented Jun 1, 2020

re-triaging this, because (as per the examples in the comments here), it's pretty disasterous that messages can get just stuck forever because of a brief federation outage. I get bitten by this every few weeks.

@MayeulC
Copy link

MayeulC commented Jun 3, 2020

I will put my thoughts down regarding this. I don't really know Synapse's internal architecture, but here is what I imagine:

Homeservers could keep a list of other homeservers they are in contact with, as well as the last successful connection. Or the timestamp of the first error, if you want to update the table less often.

image

For every event:

  1. Homeserver attempts to send an event
  2. Event sending fails
  3. The homeserver add the even to the list of failed events, updates the last_lailure timestamp for that homeserver IF it is NULL.

In parallel, exponential backoff is implemented at the homeserver level, not at the room level.

  1. Check every homeserver in the list where the last_failure is not NULL (or NULL, and for which the missed events table is not empty).
  2. Apply exponential backoff rules to determine when will the next attempt be
  3. Try contacting again every server that meets the condition
  4. Send every missed event to the homeservers that reply

And incoming connections can be a source of truth: when receiving a connection from a server that was previously down, clear the last_error field.


A couple issues/thoughts with that implementation:

  • Requires storing a large number of events. This can be bypassed simply by storing a room id instead of an event where applicable, and sending a dummy event to the server once it comes back online, which will make it backfill
  • I am afraid this scheme might not work well for misconfigured servers (NAT, firewall, incorrect DNS records) that can only send, and not receive, data. The last_failure would be cleared when they send an event, and put back right away when the homeserver tries to contact them
  • Servers that have been offline for a while will get hammered with requests once they send back something. Maybe a random waiting time could be adjusted depending on the number of events to send? The servers will schedule backfilling as they see fit, anyway.

This is even more important if we think about a p2p world, and is one of the glaring issues with Matrix at present.

In parallel, it would be nice for servers to schedule a sync at startup.

@richvdh richvdh added z-p2 (Deprecated Label) p1 and removed p1 z-p2 (Deprecated Label) labels Jun 4, 2020
@neilisfragile
Copy link
Contributor

In case it is helpful, here is an implementation in Dendrite.
matrix-org/dendrite#1077

@neilisfragile
Copy link
Contributor

Next steps:

  • Assess the above
  • Propose a solution and review from the core team
  • Fix it

@reivilibre
Copy link
Contributor

Making /pull work properly, and use it (#2527)

In #2527, you have said /pull is no longer a thing.

Proactively sending the data once we hear the server is back online (#2526)

In #2526, you have said 'And the fact we hear from a remote HS doesn't necessarily mean it's ready to receive unsent transactions. #2527 may be a better solution.'... oops.

In any case, barring any new ideas, I take it we are left with #2526.

I'm tempted to take this on, but don't know if I know enough about federation (but it might be a worthwhile chance to learn!).

I'm tempted to also believe that only the recovering homeserver is in a best position to know about whether it is 'ready' to catch up.

The idea of 'hearing that a homeserver is back online' seems a tad frail for some low-traffic cases; would it be the case that a recovering HS should go and prod others so it gets noticed?

I will study the Dendrite implementation.

@richvdh
Copy link
Member Author

richvdh commented Jul 12, 2020

I've been doing a bit of thinking about this over the weekend.

First of all I think there are a couple of essential requirements for any solution here:

  • It must not interfere with the current mechanism by which matrix rooms can have "gaps" in their history. That is to say, if my homeserver goes offline for a month, I do not expect to be bombarded with an entire month's history of events when it comes back. What I do want is the last few events in each room I'm in, together with any invites I've missed.

    In other words: retrying the same messages over and over again until they are successful as described in @MayeulC's comment above is a non-solution here.

  • It must be resilient to synapse restarts. Suppose server B goes down; server A has some events which it is trying to send to server B. It is no good if server A forgets about those events when it is restarted. In other words: we need to keep enough information in the database that server A knows which rooms it needs to catch server B up on when it comes back.


Now, I see two sides to this problem. The first is how we decide to resend data over over to the other homeserver; the second is figuring out what that data should be.

The first part is relatively easy: as per the opening comment, options include having the other server pull or resending when we get an incoming request from the remote server. Another option might just be an extended retry schedule.

We essentially already have this mechanism today: when we successfully receive an incoming transaction, we will try sending any device list updates or to-device messages (see https://github.com/matrix-org/synapse/blob/release-v1.16.1/synapse/federation/sender/__init__.py#L494). To my mind, this is by far the most promising option for a hook for resending room events too, although as noted in #2526 (comment) and in @MayeulC 's comment above, this solution not without its downsides. Still, it'll be fine for a first pass.

(Incidentally, it looks to me as though matrix-org/dendrite#1077 is all about this first part of deciding when to send new data, rather than what to send. It also appears to be focussed somewhat on the P2P case, though maybe I'm missing something. We should check with the Dendrite folks - and indeed other HS developers - if they have a more complete solution to this problem though.)


So, the second part of the problem: once server A decides (by whatever mechanism) that server B is back in the game, how does it know what to send?

An idea:

Currently, synapse's outgoing federation logic iterates though the event stream in stream_ordering order, deciding where each event in that stream ought to be sent to. (see https://github.com/matrix-org/synapse/blob/release-v1.16.1/synapse/federation/sender/__init__.py#L153 and following). I propose that we maintain a table destination_rooms mapping (destination, room_id) to latest stream_ordering, which we update each time an event passes through this event loop.

We also add to the current destinations table a last_successful_stream_ordering column which records the stream_ordering of the last event successfully sent to that destination - updated after each successful transaction transmission.

So then, in wake_destination, we compare the last_successful_stream_ordering of the awoken destination with that of each room in destination_rooms, and thereby get a list of rooms which need an update. For each such room, we send over the event with the relevant stream_ordering. This should trigger a backfill request from the other end.

Outstanding problems:

  • once we get an incoming request, we will clear the backoff, which might mean a regular outgoing transaction overwrites last_successful_stream_ordering before we get a chance to do the catchup. How can we avoid this?

@erikjohnston
Copy link
Member

That all broadly makes sense. I think the interesting thing to note is that when catching up we're going to have to support doing so in batches (to stay under the event limit). That sort of implies that we need to keep around ranges in the DB about gaps we still need to send to a remote server.

I think one way of doing that is to also have a table destination_event_gaps of destination, start_stream_ordering and end_stream_ordering, where each rows is a known gap. Then:

  1. After successfully sending a transaction, we:
    a. update destination.last_successful_stream_ordering with latest stream ordering in transaction,
    b. update destination_event_gaps.end_stream_ordering with the first stream ordering in transaction for any rows where end_stream_ordering is NULL for that destination.
  2. After a transaction fails to send and we drop it, we insert a row destination_event_gaps for the destination if one does not exist with a NULL end_stream_ordering setting start_stream_ordering to destination.last_successful_stream_ordering.

This way the destination_event_gaps is a table the records all gaps, where a NULL end_stream_ordering means that the gap extends to "now". If a remote server is accepting transactions has an entry in destination_event_gaps then we can include events in the transaction from the gap, updating (or removing) the row after successfully sending the transaction.

On startup we could also go through all destinations and see if there are any entries in destination_rooms to see if any events failed to send before we last stopped, inserting them into destination_event_gaps, which will then get picked up as part of the process to send missed events.

@richvdh
Copy link
Member Author

richvdh commented Jul 13, 2020

I think the interesting thing to note is that when catching up we're going to have to support doing so in batches (to stay under the event limit). That sort of implies that we need to keep around ranges in the DB about gaps we still need to send to a remote server.

As discussed in #synapse-dev: there's an alternative solution to this, which seems simpler and I'd like to propose as an initial implementation.

Essentially, we make sure to catch up with the least-recently-updated rooms first. In other words, on each transaction attempt, we do SELECT room_id, stream_ordering FROM destination_rooms WHERE stream_ordering > <last_successful_stream_ordering> ORDER BY stream_ordering LIMIT 50, and only start popping events off the in-memory queue once that query returns no results.

A concern with this mechanism is that we might never be able to catch up with real time in this way, since it's much slower for a server to have to request missed data with get_missing_events than it is for it to be pushed that data in order. Nevertheless I think this is worth a go; if it proves unworkable we may have to do as Erik suggests to allow us to continue pushing new data in real time at the same time as we catch up with older rooms.

reivilibre added a commit to matrix-org/sygnal that referenced this issue Jul 14, 2020
this morning: More Sygnal#130 (HTTP proxy) rework, I'm feeling it
straightening out a lot so hopefully it'll be back in the queue soon;

today: 'More of that'; Catch up on #2528 and #7828 which Riiich has
suggested solving first
Github
matrix-org/synapse#2528 : Homeservers don't
catch up with missed traffic until someone sends another event
matrix-org/synapse#7828 : in-memory federation
transaction transmission queues build up indefinitely for offline
servers
@reivilibre
Copy link
Contributor

One note that may be worth considering (but probably not for now)

It must not interfere with the current mechanism by which matrix rooms can have "gaps" in their history. That is to say, if my homeserver goes offline for a month, I do not expect to be bombarded with an entire month's history of events when it comes back. What I do want is the last few events in each room I'm in, together with any invites I've missed.

There is a little issue in this in that a homeserver may miss events for which notifications should be generated, e.g. mentions.

Perhaps this sounds silly for a month-long outage but if we take the example of an outage of a few hours on a moderately-busy room, it wouldn't sound silly to me to say that there was a reliability problem here if I didn't get notified when someone mentions me.

@richvdh
Copy link
Member Author

richvdh commented Sep 23, 2020

I think this is fixed by #8272 and previous PRs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants