Faster joins: smarter algorithm for picking a server to resync from #12999

richvdh · 2022-06-09T08:04:30Z

Per #12813 (comment): when we restart resyncing a room, it can take a while before we find an appropriate server.

Lines 1517 to 1520 in 7c6b220

    
           # TODO(faster_joins): we need some way of prioritising which homeservers in 
        
           #   `other_destinations` to try first, otherwise we'll spend ages trying dead 
        
           #   homeservers for large rooms. 
        
           #   https://github.com/matrix-org/synapse/issues/12999

richvdh · 2022-10-04T12:57:31Z

We should prioritise the server that we did the join through initially.

DMRobertson · 2022-10-07T15:50:32Z

To check: is the given list of other_destinations acceptable---we just need to prioritise that list? Or do we need to change the callers to provide a wider list of other_destinations?

I think we have some table that tracks "we failed to make a federation request from this server recently"---destinations? So should exclude anyone in that?

Other thoughts:

Prioritise servers that people have sent event from recently
Prioritise the server that sent the m.room.create event(?)

MadLittleMods · 2022-10-07T19:15:50Z

I think we have some table that tracks "we failed to make a federation request from this server recently"---destinations? So should exclude anyone in that?

This happens automatically for you by anything that uses synapse/http/matrixfederationclient.py because it uses the RetryDestinationLimiter (retryutils limiter). It does use the destinations table under the hood 👍

You just need to catch the NotRetryingDestination exception and continue to the next destination to try.

Also of interest is the generic pattern we have for this _try_destination_list(...),

synapse/synapse/federation/federation_client.py

Lines 759 to 767 in 1bf2832

    
           async def _try_destination_list( 
        
               self, 
        
               description: str, 
        
               destinations: Iterable[str], 
        
               callback: Callable[[str], Awaitable[T]], 
        
               failover_errcodes: Optional[Container[str]] = None, 
        
               failover_on_unknown_endpoint: bool = False, 
        
           ) -> T: 
        
               """Try an operation on a series of servers, until it succeeds

we just need to prioritise that list?

One of the heuristics we have around prioritizing destinations to try is

synapse/synapse/storage/databases/main/roommember.py

Lines 970 to 972 in 1bf2832

    
                   The heuristic of sorting by servers who have been in the room the 
        
                   longest is good because they're most likely to have anything we ask 
        
                   about.

But that doesn't exactly solve the problem if the goal is to avoid long-gone dead servers. Does our NotRetryingDestination tracking already solve this though?

We should prioritise the server that we did the join through initially.

This makes sense since it's a recent alive server that had the state to authorize the join 👍

richvdh · 2022-10-10T13:37:24Z

Note that the particular problem here comes when we resume resyncing after a restart, which currently causes us to lose our place in the list of potential destinations (initial_destination will be None in this case).

So one solution here is just to arrange for initial_destination to be populated.

To check: is the given list of other_destinations acceptable---we just need to prioritise that list? Or do we need to change the callers to provide a wider list of other_destinations?

other_destinations remains our best guess of servers that might be able to handle the request, so I think the answer is "yes, we just need to prioritise that list".

I think we have some table that tracks "we failed to make a federation request from this server recently"---destinations? So should exclude anyone in that?

As @MadLittleMods, the destinations table is already factored in, at a lower layer of abstraction. However, it's a very lossy abstraction because it assumes that all requests and all failure modes are created equal, which really isn't the case (cf #8917). But that's something we should aim to improve in general, rather than for one particular usecase. So, I'd just ignore destinations for now, and as @MadLittleMods, just make sure we catch the relevant exceptions, and move on.

Other thoughts:

* Prioritise servers that people have sent event from recently

Possibly, though really this sort of thing is meant to be handled via the destinations mechanism (successfully sending a transaction removes your failure flag from the destinations table).

* Prioritise the server that sent the m.room.create event(?)

I'm not sure that the server that sent the m.room.create event is much more or less likely to be reachable than any of the other servers in the room.

So: yes, let's just keep using other_destinations, but prioritise:

the server that we joined through
Ideally: the server that we most recently did a successful resync from.

DMRobertson · 2022-10-10T15:57:05Z

Discussed with @richvdh. Actions here:

Catch NotRetryingDestination during the resync process so that we'll try another server rather than fail immediately.
When we first partial-join to a room, persist the server we join via in the partial_state_rooms table. Read from this to determine the server that we joined through. Prioritise that server.
Rich doesn't believe we'll ever call /state as part of the resync process for any event other than the join we requested in the beginning. Therefore there's no point tracking or persisting the server we have most recently synced state from.

richvdh added this to the Faster joins (further work) milestone Jun 9, 2022

richvdh mentioned this issue Jun 9, 2022

Faster room joins: Resume state re-syncing after a Synapse restart #12813

Merged

squahtx added A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jun 9, 2022

richvdh modified the milestones: Q3 2022: Faster joins: fix major known bugs for monoliths, Q4 2022: faster joins: worker-mode and remaining work Jul 20, 2022

kittykat mentioned this issue Oct 3, 2022

Improve time to join a remote room #14030

Closed

16 tasks

DMRobertson self-assigned this Oct 7, 2022

DMRobertson mentioned this issue Oct 10, 2022

When restarting a partial join resync, prioritise the server which actioned a partial join #14126

Merged

DMRobertson closed this as completed in #14126 Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster joins: smarter algorithm for picking a server to resync from #12999

Faster joins: smarter algorithm for picking a server to resync from #12999

richvdh commented Jun 9, 2022 •

edited

Loading

richvdh commented Oct 4, 2022

DMRobertson commented Oct 7, 2022

MadLittleMods commented Oct 7, 2022

richvdh commented Oct 10, 2022

DMRobertson commented Oct 10, 2022

Faster joins: smarter algorithm for picking a server to resync from #12999

Faster joins: smarter algorithm for picking a server to resync from #12999

Comments

richvdh commented Jun 9, 2022 • edited Loading

richvdh commented Oct 4, 2022

DMRobertson commented Oct 7, 2022

MadLittleMods commented Oct 7, 2022

richvdh commented Oct 10, 2022

DMRobertson commented Oct 10, 2022

richvdh commented Jun 9, 2022 •

edited

Loading