Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix MSC3030 /timestamp_to_event returning outliers that it has no idea whether are near a gap or not #14215

Merged

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Oct 17, 2022

Fix MSC3030 /timestamp_to_event endpoint returning outliers that it has no idea whether are near a gap or not (and therefore unable to determine whether it's actually the closest event). The reason Synapse doesn't know whether an outlier is next to a gap is because our gap checks rely on entries in the event_edges, event_forward_extremeties, and event_backward_extremities tables which is not the case for outliers.

Also fixes MSC3030 Complement can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint test flake. Although this acted flakey in Complement, if sync_partial_state raced and beat us before /timestamp_to_event, then even if we retried the failing /context request it wouldn't work until we made this Synapse change. With this PR, Synapse will never return an outlier event so that test will always go and ask over federation.

Fix #13944

Why did this fail before? Why was it flakey?

Sleuthing the server logs on the CI failure, it looks like hs2:/timestamp_to_event found $NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU as an outlier event locally. Then when we went and asked for it via /context, since it's an outlier, it was filtered out of the results -> You don't have permission to access that event.

This is reproducible when sync_partial_state races and persists $NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU as an outlier before we evaluate get_event_for_timestamp(...). To consistently reproduce locally, just add a delay at the start of get_event_for_timestamp(...) so it always runs after sync_partial_state completes.

from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d

In a run where it passes, on hs2, get_event_for_timestamp(...) finds a different event locally which is next to a gap and we request from a closer one from hs1 which gets backfilled. And since the backfilled event is not an outlier, it's returned as expected during /context.

With this PR, Synapse will never return an outlier event so that test will always go and ask over federation.

Dev notes

SYNAPSE_POSTGRES=1 SYNAPSE_TEST_LOG_LEVEL=INFO python -m twisted.trial tests.rest.client.test_rooms.TimestampLookupTestCase

Complement:

COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint
COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@MadLittleMods MadLittleMods added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Oct 17, 2022
@MadLittleMods MadLittleMods changed the title Fix /timestamp_to_event returning outliers it has no idea whether are near a gap or not Fix MSC3030 /timestamp_to_event returning outliers it has no idea whether are near a gap or not Oct 17, 2022
Comment on lines 2149 to 2156
/**
* Make sure the event isn't an `outlier` because we have no way
* to later check whether it's next to a gap. `outliers` do not
* have entries in the `event_edges`, `event_forward_extremeties`,
* and `event_backward_extremities` tables to check against
* (used by `is_event_next_to_backward_gap` and `is_event_next_to_forward_gap`).
*/
AND outlier = ? /* False */
Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only new thing in get_event_id_for_timestamp(...) is here. The rest is just moving stuff around with the f-string

@MadLittleMods MadLittleMods force-pushed the madlittlemods/13944-fix-msc3030-returning-outliers branch from 8071e83 to aebdb22 Compare October 18, 2022 05:27
@MadLittleMods MadLittleMods force-pushed the madlittlemods/13944-fix-msc3030-returning-outliers branch from aebdb22 to 3c009c7 Compare October 18, 2022 05:30
@MadLittleMods MadLittleMods marked this pull request as ready for review October 18, 2022 06:27
@MadLittleMods MadLittleMods requested a review from a team as a code owner October 18, 2022 06:27
@DMRobertson DMRobertson changed the title Fix MSC3030 /timestamp_to_event returning outliers it has no idea whether are near a gap or not Fix MSC3030 /timestamp_to_event returning outliers that it has no idea whether are near a gap or not Oct 18, 2022
Copy link
Contributor

@H-Shay H-Shay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sane to me.

@MadLittleMods MadLittleMods merged commit fa8616e into develop Oct 19, 2022
@MadLittleMods MadLittleMods deleted the madlittlemods/13944-fix-msc3030-returning-outliers branch October 19, 2022 00:46
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @H-Shay 🦏

@erikjohnston
Copy link
Member

Fix MSC3030 /timestamp_to_event endpoint returning outliers that it has no idea whether are near a gap or not (and therefore unable to determine whether it's actually the closest event). The reason Synapse doesn't know whether an outlier is next to a gap is because our gap checks rely on entries in the event_edges, event_forward_extremeties, and event_backward_extremities tables which is not the case for outliers.

Slight side note: ignoring outliers is a good thing to do here. Though we might (in future) want to do more here to actually check if we think we have events around that timestamp (rather than a gaping hole). Consider the case where the server was down for a few weeks (or left and rejoined the room), in that period we won't have any events obviously, but queries to this endpoint will currently return events from one of those two "chunks" of DAG either side of the hole. Instead, the server should try and detect this case and hit out over federation instead. This is hard right now, as we don't really track the "chunks" of DAG we have, so makes it hard to detect when we have holes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Complement can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint is flakey(?)
4 participants