Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Admin API - Users' message event usage statistics #11871

Open
buffless-matt opened this issue Feb 1, 2022 · 10 comments
Open

Admin API - Users' message event usage statistics #11871

buffless-matt opened this issue Feb 1, 2022 · 10 comments
Labels
A-Admin-API T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@buffless-matt
Copy link
Contributor

buffless-matt commented Feb 1, 2022

Are there any plans to expand the pre-existing admin API offerings for statistics?

E.g. While the pre-existing "Users' media usage statistics" offering covers media, it doesn't cover message events. I'm thinking about this through a lens of:

  • Keeping track of users and/or rooms that consume the most resources (e.g. top 10 users/rooms).
  • Yes a media datum will typically be larger than a message event, but if there are users and/or rooms that are emitting large quantities of message events (e.g. IoT bots), that message event volume could build up.

Edit (2022-03-24): It appears as though some functionality relating to this had existed.

@clokep clokep added T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. A-Admin-API labels Feb 1, 2022
@clokep
Copy link
Member

clokep commented Feb 1, 2022

@buffless-matt I think we would accept PRs for expanding the information available about users, especially as I can see how those could be used to try to find abusive accounts. There's currently no plans to add such a feature though.

@buffless-matt
Copy link
Contributor Author

I'm planning on making two PRs for this:

  1. Similar to the media end-point mentioned (above), but for events sent by users.
  2. Grouping by room, to quantify events received via federation (i.e. from other homeservers).

I'm not sure on how to calculate a fair approximation of event storage, but I've come up with the following queries (respectively):

SELECT p.displayname AS displayname, 
       COUNT(ej.json) AS events_count, 
       SUM(octet_length(ej.json)) AS events_length, 
       e.sender AS user_id
FROM events e
JOIN event_json ej ON (e.event_id = ej.event_id)
JOIN users u ON (e.sender = u.name)
LEFT JOIN profiles p ON (u.name = '@' || p.user_id || ':' ?HOSTNAME?)
GROUP BY p.displayname,
         e.sender
SELECT e.room_id AS room_id,
       r.name AS room_name,
       COUNT(ej.json) AS events_count,
       SUM(octet_length(ej.json))
FROM events e
JOIN event_json ej ON (e.event_id = ej.event_id)
JOIN room_stats_state r ON (e.room_id = r.room_id)
LEFT JOIN users u ON (u.name = e.sender)
WHERE u.name IS NULL
GROUP BY e.room_id,
         r.name

Note: I'm aware that the JSON string byte length calculation will need to be branched (at run-time) based on the storage engine used (e.g. SQLite will need something like length(CAST (field AS BLOB)))..

Could I get some feedback on the above please?

@reivilibre
Copy link
Contributor

reivilibre commented Mar 17, 2022

Hi Matt,

This statistic (count of events, and count of event bytes per-user) used to exist, but I think it got removed because it wasn't used. I don't know if the background job that incrementally tracked these things still exists for some other metrics or whether it got ripped out too.

I will just note that these queries will be very slow on realistic homeservers (and I'm not sure what joining to profiles is gaining you). For the first case, I come up with these:

-- Count + Bytes
SELECT COUNT(ej.json) AS events_count, 
       SUM(octet_length(ej.json)) AS events_length, 
       e.sender AS user_id
FROM events e
JOIN event_json ej USING (event_id)
LEFT JOIN users u ON (e.sender = u.name)
GROUP BY e.sender;


-- Count (try your best to keep it as an index-only scan)
SELECT COUNT(e.sender) AS events_count, 
       e.sender AS user_id
FROM events e
JOIN users u ON (e.sender = u.name)
GROUP BY e.sender;

Note that events have a ~65 kiB size limit so you may not need to fret about the exact consumed size too much.
(the count-only query is much much faster [few seconds rather than minutes on my homeserver] as it doesn't need to touch the event_json table and I'm not sure it even needs to touch events, since it might be an index-only query, but I haven't spent long enough to check).

@buffless-matt
Copy link
Contributor Author

Hi @reivilibre,

Thanks for the feedback!

This statistic (count of events, and count of event bytes per-user) used to exist, but I think it got removed because it wasn't used. I don't know if the background job that incrementally tracked these things still exists for some other metrics or whether it got ripped out too.

Interesting, I'll dig around through the history and see what I can find.

(and I'm not sure what joining to profiles is gaining you)

Was aiming to provide similar output to the pre-existing media stats end-point (and hence followed a similar implementation approach).

Note that events have a ~65 kiB size limit so you may not need to fret about the exact consumed size too much.
(the count-only query is much much faster [few seconds rather than minutes on my homeserver] as it doesn't need to touch the event_json table and I'm not sure it even needs to touch events, since it might be an index-only query, but I haven't spent long enough to check).

This is a good point, thanks! I agree that the count should be enough, so I'll aim for index-only queries (assuming there isn't already something else buried in the code that I can use, which you alluded to above).

@buffless-matt
Copy link
Contributor Author

This statistic (count of events, and count of event bytes per-user) used to exist, but I think it got removed because it wasn't used. I don't know if the background job that incrementally tracked these things still exists for some other metrics or whether it got ripped out too.

Interesting, I'll dig around through the history and see what I can find.

I presume #9602 is what you were referring to and it looks to me (from the linked PR) that the tracking got ripped out too. Is that right?

...so I'll aim for index-only queries...

Unfortunately there is no index on the sender column of the events table, so even with no joins (but GROUP BY on sender), a heap scan is required.

Assuming we're happy to go down this road, is an index on sender feasible, or should we look at going back to incremental tracking?

P.S. I've linked an EMS PR to this issue (to hopefully share some context on EMS use-cases for these proposed admin end-points).

@reivilibre
Copy link
Contributor

I presume #9602 is what you were referring to and it looks to me (from the #9721) that the tracking got ripped out too. Is that right?

Looks like you came to the right conclusion, yeah! The design of that code changed over time so I forgot how it was actually implemented until looking at it just now; some metrics used to be bucketed into time slices and total_event and total_event_bytes was one of those. The code that tracks metrics in a time-sliced way was removed, so perhaps one answer here is to add it as a metric for which we always track the 'latest' state (rather than time slices).

Assuming we're happy to go down this road, is an index on sender feasible, or should we look at going back to incremental tracking?

I think incrementally tracking it may be less expensive than adding an index... events is a massive table so I would normally be afraid of adding an index (due to disk space).

On my personal homeserver, I tried both a hash index and a B-tree index (I named it events_sender).

  • Context for librepush.net
    • Number of events rows: 3293302
    • pg_size_pretty(pg_total_relation_size('events')) before adding an index: 1793 MB
  • Hash index CREATE INDEX CONCURRENTLY events_sender ON events USING hash (sender);
    • pg_size_pretty(pg_total_relation_size('events')) after adding an index: 1936 MB
    • pg_size_pretty(pg_relation_size('events_sender')) (index size): 144 MB
  • B-tree index CREATE INDEX CONCURRENTLY events_sender ON events USING btree (sender);
    • pg_size_pretty(pg_total_relation_size('events')) after adding an index: 1820 MB
    • pg_size_pretty(pg_relation_size('events_sender')) (index size): 27 MB

I was quite surprised by how small these were, especially the B-tree which I thought would have been larger than a hash index!
Overall it comes to about a 2% increase in used disk space.

Frankly, incremental tracking will still likely make more sense if you only care about the number of rows (but it is more complex to maintain), but I was surprised to see that this workable.
I suppose this may be worth mentioning in the discussion later, since this ticket is up for discussion anyway.

@callahad
Copy link
Contributor

We're not enthusiastic about adding more indexes to the events table, so suspect incremental tracking in a separate metrics table would be more palatable.

@reivilibre
Copy link
Contributor

(room_stats_current would be a suitable place to track that, along with user_stats_current. There's a background process that updates the statistics which follows the events stream to update these; I think that's what you want to tap into (or at least the approach to take).

@buffless-matt
Copy link
Contributor Author

(room_stats_current would be a suitable place to track that, along with user_stats_current. There's a background process that updates the statistics which follows the events stream to update these; I think that's what you want to tap into (or at least the approach to take).

Thank you. Which background process are you referring to by the way? Is it this one?

@erikjohnston
Copy link
Member

(room_stats_current would be a suitable place to track that, along with user_stats_current. There's a background process that updates the statistics which follows the events stream to update these; I think that's what you want to tap into (or at least the approach to take).

Thank you. Which background process are you referring to by the way? Is it this one?

Yup, that is the place we update the those stats.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Admin-API T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests

5 participants