This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691
Labels
A-Federation
O-Occasional
Affects or can be seen by some users regularly or most users rarely
S-Minor
Blocks non-critical functionality, workarounds exist.
T-Defect
Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Description
When a remote server falls behind on federation, Synapse back off and starts batching up requests. Usually this isn't too bad as the remote end will only be maybe 1 or 2 transactions behind, however more serious occurrences can put the server behind by hundreds of transactions or thousands of events.
Many of the messages could be encrypted, which means they'll be potentially accompanied by to_device EDUs in order to decrypt the messages on the client side. If the EDUs aren't sent as part of the catchup transactions, it's possible for the clients to not be able to decrypt messages and thus make users sad/angry.
Here's an example of this happening in real life:
For background on this graph: t2bot.io (the server in question) runs 2 federation readers, 1 of which (03) is dedicated to just handling matrix.org's traffic. The other (04) is left to handle any other random server which might exist in the wild.
In the graph, t2bot.io was behind on matrix.org's transactions and thus had a very spikey waveform due to the 50 PDU transactions having to be retried. When it did catch up, it was also met with all the EDUs it missed, creating a significant spike. Traffic after that is then normal.
This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.
Version information
If not matrix.org:
Version: 1.22.0 (with minor, unrelated, patches)
Install method: pip
The text was updated successfully, but these errors were encountered: