Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

turt2live · 2020-10-30T04:59:49Z

Description

When a remote server falls behind on federation, Synapse back off and starts batching up requests. Usually this isn't too bad as the remote end will only be maybe 1 or 2 transactions behind, however more serious occurrences can put the server behind by hundreds of transactions or thousands of events.

Many of the messages could be encrypted, which means they'll be potentially accompanied by to_device EDUs in order to decrypt the messages on the client side. If the EDUs aren't sent as part of the catchup transactions, it's possible for the clients to not be able to decrypt messages and thus make users sad/angry.

Here's an example of this happening in real life:

For background on this graph: t2bot.io (the server in question) runs 2 federation readers, 1 of which (03) is dedicated to just handling matrix.org's traffic. The other (04) is left to handle any other random server which might exist in the wild.

In the graph, t2bot.io was behind on matrix.org's transactions and thus had a very spikey waveform due to the 50 PDU transactions having to be retried. When it did catch up, it was also met with all the EDUs it missed, creating a significant spike. Traffic after that is then normal.

This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.

Version information

Homeserver: t2bot.io

If not matrix.org:

Version: 1.22.0 (with minor, unrelated, patches)
Install method: pip

Platform: Ubuntu 20.04, bare metal

erikjohnston · 2020-10-30T11:02:15Z

Looks like we prioritise device list updates first: https://github.com/matrix-org/synapse/blob/develop/synapse/federation/sender/per_destination_queue.py#L255-L266, which feels like it is probably the wrong way round.

richvdh · 2020-11-30T15:36:24Z

this is a particular problem, because if your server spends a lot of time lagging behind, then you can end up receiving room events but never the e2e keys for those events.

verymilan · 2020-11-30T18:10:35Z

This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.

it could totally be a coincidence, but i actually feel like this happened after a recent synapse update (#8838)

richvdh · 2020-11-30T18:14:06Z

It's almost certainly been exacerbated by the fixes to #2528.

mtippmann · 2022-01-14T13:46:41Z

This nasty issue is still there, our homeserver had 3 days downtime and it took about 4 days until messages from matrix.org could be decrypted, the messages came through instantly but decryption was not possible due to missing keys from matrix.org

We also run several federation_sender and generic worker instances so it's unlikely that's it's overload on our side due to downtime. While debugging we heard reports from other homeserver admins that they are familiar with such issues after longer downtimes.

Would be great if this could be fixed sometime.

erikjohnston added enhancement A-Federation labels Oct 30, 2020

richvdh mentioned this issue Nov 30, 2020

Unable to decrypt messages from matrix.org #8838

Closed

H-Shay added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jan 14, 2022

DMRobertson added O-Occasional Affects or can be seen by some users regularly or most users rarely and removed z-enhancement labels Aug 25, 2022

matrixbot mentioned this issue Dec 21, 2023

Federation catchup doesn't send to_device EDUs until the remote end has caught up element-hq/synapse#8691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

turt2live commented Oct 30, 2020

erikjohnston commented Oct 30, 2020

richvdh commented Nov 30, 2020

verymilan commented Nov 30, 2020

richvdh commented Nov 30, 2020

mtippmann commented Jan 14, 2022

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Comments

turt2live commented Oct 30, 2020

Description

Version information

erikjohnston commented Oct 30, 2020

richvdh commented Nov 30, 2020

verymilan commented Nov 30, 2020

richvdh commented Nov 30, 2020

mtippmann commented Jan 14, 2022