My HS has accumulated thousands of unreferenced state groups #3364

richvdh · 2018-06-07T00:06:50Z

... which are filling up my disk :(

To check if you are also affected, run this query:

select count(*) from state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;

if you see numbers in the thousands, then it is this issue. Otherwise, you're not affected by this issue.

richvdh · 2018-06-07T01:28:15Z

worse, when I go to purge history, the unreferenced state groups start turning into non-delta state groups, which makes the whole thing worse.

richvdh · 2018-06-07T09:20:17Z

redundant state groups:

create temporary table unreferenced_state_groups as 
select sg.id, sg.room_id from
    state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;

(empirically most of them seem to be coming from HQ)

krombel · 2018-06-07T17:02:16Z

I seem to have them as well. I created that table on my system as well and get the following response:

synapse=# SELECT COUNT(*) FROM state_groups_to_drop;
 count
-------
  2272
(1 row)

Just to note: I have not run any purge commands yet

it's easier to create the new state group as a delta from the existing one. (There's an outside chance this will help with #3364)

richvdh · 2018-07-30T18:30:56Z

#3625 might be related

richvdh · 2018-09-03T14:49:57Z

neil I don't think this can be a p2; it's a real blocker on cleaning up disk space

richvdh · 2018-09-04T22:07:22Z

I've done a little bit of digging into why this happens. Other than #3625, another cause (which probably bites matrix.org heavily, but others less so) is #3791.

richvdh · 2018-09-10T11:20:10Z

Another occasion that (I think) this happens is when we have a fork in the DAG, with different state on the two sides of the fork, and the next event (which heals the fork) is itself another state event. We create a new state group when we state-resolve the two sides for the fork (which is important for caching state res), but that SG is never actually (directly) used because we then create another SG to include the updated state.

sargon · 2018-09-16T12:14:53Z

We have kind of a big disk filling database too ~(45G) and ~40 Users. We started to purge the history some time ago, monthly, so the db should contain only the data of the last 365 days with a sloop of 30 days. So I was curious have many tuples in state_groups_state would be affected in our database, so I extended your temporary table query a little bit:

create temporary table unreferenced_state_groups as 
select sg.id, sg.room_id, count(sgs.*) as cgs_cnt from
    state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
    left join state_groups_state sgs on sgs.state_groups_state = sg.id 
where esg.state_group is null and e.prev_state_group is null
group by sg.id;
select sum(cgs_cnt) from unreferenced_state_groups;

Which resulted in 1.388.475 affected tupels, which is kind of nothing in contrast to 84.141.600 tupels in the table. So this Is definitely a thing, but my guess is that we have other waste in that database, or is this a "normal/to be expected" size?

richvdh · 2018-09-17T11:28:21Z

@sargon:

Which resulted in 1.388.475 affected tupels, which is kind of nothing in contrast to 84.141.600 tupels in the table.

Those 1.3M tuples will just be the deltas from the previous state groups - probably only one or two rows per state group. The problem comes when a state group is removed, which means that any other state group which references it will have to be converted from delta storage to a absolutes - ie, we will have to store every single state event for the room for each of those state groups.

Suppose we have three state groups in a room, 1, 2, and 3. 1 is the first state group, and 2 and 3 are both stored as deltas from 1:

  1
 /  \
2    3

SG1 and SG3 are both used for a number of events in the room, but as per this bug, SG2 is unused. Now we purge some events from this room. SG1 and SG3 are detected as unused and deleted. However, SG2 is losing its parent, so needs "de-deltaing".

Multiply this effect by 1.3M, and you have a real problem.

ghost · 2019-06-27T23:52:14Z

hi,

i believe i'm facing the same problem described a year earlier in this issue: the whole database weigh 14 GB (7 users registered only, no huge rooms joined...)

# SELECT pg_size_pretty( pg_database_size('matrix_prod') );
 pg_size_pretty 
----------------
 14 GB
(1 row)

here are the biggest tables:

matrix_prod=# select schemaname as table_schema,
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as data_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
      as external_size
from pg_catalog.pg_statio_user_tables
order by pg_total_relation_size(relid) desc,
         pg_relation_size(relid) desc
limit 10;
 table_schema |        table_name         | total_size | data_size | external_size 
--------------+---------------------------+------------+-----------+---------------
 public       | state_groups_state        | 4724 MB    | 3134 MB   | 1590 MB
 public       | event_json                | 2857 MB    | 2502 MB   | 354 MB
 public       | received_transactions     | 1221 MB    | 697 MB    | 524 MB
 public       | stream_ordering_to_exterm | 1193 MB    | 672 MB    | 520 MB
 public       | event_auth                | 907 MB     | 633 MB    | 274 MB
 public       | events                    | 811 MB     | 366 MB    | 445 MB
 public       | event_edges               | 746 MB     | 261 MB    | 485 MB
 public       | room_memberships          | 527 MB     | 284 MB    | 243 MB
 public       | event_reference_hashes    | 429 MB     | 200 MB    | 229 MB
 public       | state_events              | 312 MB     | 221 MB    | 91 MB
(10 rows)

isn't there something to do? it's labeled P1 and i think truly critical.

-- edit 10 days later
DB weight is now 16 GB 😩

sargon · 2019-11-08T09:17:37Z

Coming back to this topic.
We hit the magical ~100GB table size last week. I got it under control with the compress-state applied on every room, which took us only 4 days. After an vacuum full the database size shrinked down to something around 22GB (only the state_group_state table).
To my knowledge synapse has been patched to remove the loose ends during history purching, so that is contained. But ...

I just run the queries from above and they still find unreferenced state groups (~10k), since my knowledge about the database schema is kind of nearly none existence, can you please provide us with a query to safely getting rid of those rows.

grinapo · 2020-03-20T17:48:47Z

(Sidenote: irc bridged rooms are far beyond MatrixHQ now, with m.room.member events all over the place.)

richvdh · 2020-05-01T14:36:55Z

Another factor in this is that, as of #6320, we now create a new state group for any new state event which is submitted via the C-S API, even if that event is not accepted.

richvdh · 2020-05-01T15:58:22Z

The long and the short of this is that I think we need a script which will gradually walk the state_groups table, looking for redundant state groups and removing them.

(it would also be nice to stop some of the state groups being created in the first place, but that's a bit harder.)

carlbordum · 2022-04-20T19:21:16Z

cactus.chat is heavily affected by this, so it is probably caused by bots/bridges/appservices. Our homeserver is strange, because there are no users, only guests and mostly room.membership events.

synapse=# select count(*) from state_groups sg
synapse-#     left join event_to_state_groups esg on esg.state_group=sg.id
synapse-#     left join state_group_edges e on e.prev_state_group=sg.id
synapse-# where esg.state_group is null and e.prev_state_group is null;
  count  
---------
 6504021

BBaoVanC · 2022-04-20T22:56:39Z

Guests count as users

f0x52 · 2022-10-05T21:20:41Z

I have about 1.5 million unreferenced state groups right now, is there a recommended way to deal with them?

richvdh · 2022-10-06T15:09:20Z

The recommended way to remove unreferenced state groups is via https://github.com/erikjohnston/synapse-find-unreferenced-state-groups

aaronraimist · 2022-10-06T17:01:04Z

The README for that tool still says "Do not blindly delete all the state groups that are returned by this tool" though.

richvdh · 2022-10-06T17:14:38Z

The README for that tool still says "Do not blindly delete all the state groups that are returned by this tool" though.

Indeed. Shut down synapse first. Or omit the last, say, 100 results from that tool.

I didn't say it was a good solution to the problem. Just that it's a way to deal with it.

intelfx · 2022-10-06T17:17:24Z

So, as long as synapse is not running during the whole cleanup process, the output of the tool can be used blindly?

richvdh · 2022-10-06T17:20:30Z

So, as long as synapse is not running during the whole cleanup process, the output of the tool can be used blindly?

Yes.

Based on matrix-org/synapse#3364 (comment)

richvdh · 2023-09-25T08:24:59Z

Once again: please keep the conversation on topic. General grumbling about the size of the database is off-topic; as is anything that is improved by https://github.com/matrix-org/rust-synapse-compress-state. This issue is specifically about an accumulation of unreferenced state groups.

inistor · 2023-12-07T15:58:29Z

Could you please confirm that, as of Synapse v1.97.0 the https://github.com/erikjohnston/synapse-find-unreferenced-state-groups tool is still the recommended way to purge unreferenced state groups?

neilisfragile assigned erikjohnston Jun 8, 2018

neilisfragile added z-p2 (Deprecated Label) z-minor (Deprecated Label) labels Jun 22, 2018

richvdh added a commit that referenced this issue Jul 23, 2018

Handle delta_ids being None in _update_context_for_auth_events

c1f80ef

it's easier to create the new state group as a delta from the existing one. (There's an outside chance this will help with #3364)

richvdh added the A-Disk-Space things which fill up the disk label Aug 8, 2018

richvdh added z-bug (Deprecated Label) p1 z-major (Deprecated Label) and removed z-minor (Deprecated Label) z-p2 (Deprecated Label) labels Sep 3, 2018

This comment has been minimized.

Sign in to view

sargon mentioned this issue Mar 27, 2020

State groups relation schema #7156

Closed

erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Aug 23, 2021

jffrancob mentioned this issue Mar 3, 2022

Initial sync is taking a long time to finish #12158

Closed

richvdh mentioned this issue May 23, 2022

state_groups (& state_groups_state & state_group_edges) are not fully purged alongside the rooms #12821

Open

richvdh mentioned this issue Jul 29, 2022

Sudden CPU spike to 100% and disk space decreasing fast - history purges #13361

Open

dklimpel mentioned this issue Aug 8, 2022

Deleting events causes database corruption #13476

Closed

MadLittleMods added the A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db label Aug 24, 2022

rubo77 mentioned this issue Oct 5, 2022

Remove unreferenced state groups matrix-org/rust-synapse-compress-state#105

Closed

aaronraimist added a commit to aaronraimist/synapse-find-unreferenced-state-groups that referenced this issue Oct 6, 2022

Clarify when it is okay to delete state groups

cb9ccad

Based on matrix-org/synapse#3364 (comment)

aaronraimist mentioned this issue Oct 6, 2022

Clarify when it is okay to delete state groups erikjohnston/synapse-find-unreferenced-state-groups#8

Open

richvdh mentioned this issue Feb 6, 2023

We (think that we) leak state groups during partial state resyncs #15000

Open

This comment was marked as off-topic.

Sign in to view

MomentQYC mentioned this issue Apr 26, 2023

PgSQL table state_groups_state is too large #15493

Closed

pmaier1 added the roadmap label Jun 2, 2023

This comment was marked as off-topic.

Sign in to view

matrixbot mentioned this issue Dec 21, 2023

My HS has accumulated thousands of unreferenced state groups element-hq/synapse#3364

Open

My HS has accumulated thousands of unreferenced state groups #3364

My HS has accumulated thousands of unreferenced state groups #3364

Comments

richvdh commented Jun 7, 2018 • edited Loading

richvdh commented Jun 7, 2018

richvdh commented Jun 7, 2018 • edited Loading

krombel commented Jun 7, 2018 • edited Loading

richvdh commented Jul 30, 2018

richvdh commented Sep 3, 2018

richvdh commented Sep 4, 2018

richvdh commented Sep 10, 2018

sargon commented Sep 16, 2018 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

richvdh commented Sep 17, 2018 • edited by ara4n Loading

ghost commented Jun 27, 2019 • edited by ghost Loading

sargon commented Nov 8, 2019 • edited Loading

This comment has been minimized.

grinapo commented Mar 20, 2020

richvdh commented May 1, 2020

richvdh commented May 1, 2020

carlbordum commented Apr 20, 2022

BBaoVanC commented Apr 20, 2022

f0x52 commented Oct 5, 2022

richvdh commented Oct 6, 2022

aaronraimist commented Oct 6, 2022

richvdh commented Oct 6, 2022

intelfx commented Oct 6, 2022

richvdh commented Oct 6, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

richvdh commented Sep 25, 2023

inistor commented Dec 7, 2023

richvdh commented Jun 7, 2018 •

edited

Loading

richvdh commented Jun 7, 2018 •

edited

Loading

krombel commented Jun 7, 2018 •

edited

Loading

sargon commented Sep 16, 2018 •

edited

Loading

richvdh commented Sep 17, 2018 •

edited by ara4n

Loading

ghost commented Jun 27, 2019 •

edited by ghost

Loading

sargon commented Nov 8, 2019 •

edited

Loading