Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

run_background_updates doesn't give up on faulty background updates, opting instead to fill the disk with StopIteration exceptions #12780

Closed
reivilibre opened this issue May 18, 2022 · 0 comments · Fixed by #12781

Comments

@reivilibre
Copy link
Contributor

This part of run_background_updates:

while self.enabled:
try:
result = await self.do_next_background_update(sleep)
except Exception:
logger.exception("Error doing update")
else:
if result:
logger.info(
"No more background updates to do."
" Unscheduling background update task."
)
self._all_done = True
return None

will repeatedly run a faulty background update until the disk fills up from the log spam.

For a simple reproduction case, apply this:

diff --git a/synapse/storage/background_updates.py b/synapse/storage/background_updates.py
index 3cd4a78f0..0db89d52b 100644
--- a/synapse/storage/background_updates.py
+++ b/synapse/storage/background_updates.py
@@ -393,7 +393,7 @@ class BackgroundUpdater:
         # We have a background update to run, otherwise we would have returned
         # early.
         assert self._current_background_update is not None
-        update_info = self._background_update_handlers[self._current_background_update]
+        update_info = self._background_update_handlers[self._current_background_update + 'x']
 
         async with self._get_context_manager_for_update(
             sleep=sleep,

then try and run a trial test.
You'll get reams and reams of:

Traceback (most recent call last):
  File "/home/rei/work/synapse/synapse/storage/background_updates.py", line 280, in run_background_updates
    result = await self.do_next_background_update(sleep)
  File "/home/rei/work/synapse/synapse/storage/background_updates.py", line 396, in do_next_background_update
    update_info = self._background_update_handlers[self._current_background_update + 'x']
KeyError: 'add_rooms_room_version_columnx'
2022-05-18 11:25:06,177 - synapse.storage.background_updates - 282 - ERROR - run_bg_updates - Error doing update
Traceback (most recent call last):
  File "/home/rei/.cache/pypoetry/virtualenvs/matrix-synapse-0QZlT1cR-py3.10/lib/python3.10/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: [{'update_name': 'add_rooms_room_version_column', 'depends_on': None}, {'update_name': 'current_state_events_membership', 'depends_on': None}, {'update_name': 'delete_old_current_state_events', 'depends_on': 'current_state_events_membership'}, {'update_name': 'devices_last_seen', 'depends_on': None}, {'update_name': 'event_store_labels', 'depends_on': None}, {'update_name': 'insert_room_retention', 'depends_on': None}, {'update_name': 'populate_stats_process_users', 'depends_on': 'populate_stats_process_rooms'}, {'update_name': 'redactions_have_censored_ts_idx', 'depends_on': None}, {'update_name': 'redactions_received_ts', 'depends_on': None}, {'update_name': 'remove_tombstoned_rooms_from_directory', 'depends_on': None}, {'update_name': 'room_membership_forgotten_idx', 'depends_on': None}, {'update_name': 'state_groups_room_id_idx', 'depends_on': None}, {'update_name': 'users_set_deactivated_flag', 'depends_on': None}, {'update_name': 'remove_dup_outbound_pokes', 'depends_on': None}, {'update_name': 'populate_stats_process_rooms', 'depends_on': None}, {'update_name': 'users_have_local_media', 'depends_on': None}, {'update_name': 'e2e_cross_signing_keys_idx', 'depends_on': None}, {'update_name': 'user_external_ids_user_id_idx', 'depends_on': None}, {'update_name': 'rejected_events_metadata', 'depends_on': None}, {'update_name': 'chain_cover', 'depends_on': 'rejected_events_metadata'}, {'update_name': 'remove_deactivated_pushers', 'depends_on': None}, {'update_name': 'remove_stale_pushers', 'depends_on': None}, {'update_name': 'purged_chain_cover', 'depends_on': None}, {'update_name': 'populate_rooms_creator_column', 'depends_on': None}, {'update_name': 'remove_deleted_email_pushers', 'depends_on': None}, {'update_name': 'presence_stream_not_offline_index', 'depends_on': None}, {'update_name': 'remove_hidden_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'local_group_updates_index', 'depends_on': None}, {'update_name': 'remove_deleted_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'event_arbitrary_relations', 'depends_on': None}, {'update_name': 'remove_dead_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'delete_account_data_for_deactivated_users', 'depends_on': None}, {'update_name': 'event_search_sqlite_delete_non_strings', 'depends_on': None}, {'update_name': 'state_group_edges_unique_idx', 'depends_on': None}]

During handling of the above exception, another exception occurred:

(repeated for a very long time..)

My PR #12687 triggers this but only in SyTest and only in CI, for whatever reason. The problem is that filling up the disk is a dreadful failure mode for CI, as that prevents you from getting access to the logs. (The best approach I've found is to run a script in parallel that will kill SyTest after a fixed time.)

It's not clear to me that spinning on an error is the correct behaviour. Shouldn't we just bubble it up? Perhaps after a fixed number of retries (I'm not sure why it retries, but for the sake of it I could be tempted to only bail if there have been e.g. 5 back-to-back failures.)?

I don't really know what the StopIteration is about; I suspect it's related to Twisted's _inlineCallbacks though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant