Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Open
turt2live opened this issue Oct 30, 2020 · 5 comments
Labels
A-Federation O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@turt2live
Copy link
Member

Description

When a remote server falls behind on federation, Synapse back off and starts batching up requests. Usually this isn't too bad as the remote end will only be maybe 1 or 2 transactions behind, however more serious occurrences can put the server behind by hundreds of transactions or thousands of events.

Many of the messages could be encrypted, which means they'll be potentially accompanied by to_device EDUs in order to decrypt the messages on the client side. If the EDUs aren't sent as part of the catchup transactions, it's possible for the clients to not be able to decrypt messages and thus make users sad/angry.

Here's an example of this happening in real life:
image

For background on this graph: t2bot.io (the server in question) runs 2 federation readers, 1 of which (03) is dedicated to just handling matrix.org's traffic. The other (04) is left to handle any other random server which might exist in the wild.

In the graph, t2bot.io was behind on matrix.org's transactions and thus had a very spikey waveform due to the 50 PDU transactions having to be retried. When it did catch up, it was also met with all the EDUs it missed, creating a significant spike. Traffic after that is then normal.

This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.

Version information

  • Homeserver: t2bot.io

If not matrix.org:

  • Version: 1.22.0 (with minor, unrelated, patches)

  • Install method: pip

  • Platform: Ubuntu 20.04, bare metal
@erikjohnston
Copy link
Member

Looks like we prioritise device list updates first: https://github.com/matrix-org/synapse/blob/develop/synapse/federation/sender/per_destination_queue.py#L255-L266, which feels like it is probably the wrong way round.

@richvdh
Copy link
Member

richvdh commented Nov 30, 2020

this is a particular problem, because if your server spends a lot of time lagging behind, then you can end up receiving room events but never the e2e keys for those events.

@verymilan
Copy link

This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.

it could totally be a coincidence, but i actually feel like this happened after a recent synapse update (#8838)

@richvdh
Copy link
Member

richvdh commented Nov 30, 2020

It's almost certainly been exacerbated by the fixes to #2528.

@mtippmann
Copy link

This nasty issue is still there, our homeserver had 3 days downtime and it took about 4 days until messages from matrix.org could be decrypted, the messages came through instantly but decryption was not possible due to missing keys from matrix.org

We also run several federation_sender and generic worker instances so it's unlikely that's it's overload on our side due to downtime. While debugging we heard reports from other homeserver admins that they are familiar with such issues after longer downtimes.

Would be great if this could be fixed sometime.

@H-Shay H-Shay added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jan 14, 2022
@DMRobertson DMRobertson added O-Occasional Affects or can be seen by some users regularly or most users rarely and removed z-enhancement labels Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

7 participants