Using Prometheus to monitor whether SRS is suitable for media streaming. #3141

qiantaossx · 2022-08-10T03:38:14Z

When using Prometheus, our scenario is to collect statistics such as bitrate and fps for each video stream in real-time. We have deployed our own Prometheus instance, but it is limited by the storage capacity of a single machine. Therefore, we use Grafana for visualization.

During usage, we found that the data volume is too large, and Prometheus easily encounters performance bottlenecks. We would like to discuss whether Prometheus is only suitable for collecting information from the entire set, and not suitable for monitoring the status of each individual stream.

TRANS_BY_GPT3

The text was updated successfully, but these errors were encountered:

winlinvip · 2022-08-15T02:55:52Z

If you look at the example provided by Prometheus, Use Labels

To give you a better idea of the underlying numbers, let's look at node_exporter. node_exporter exposes 
metrics for every mounted filesystem. Every node will have in the tens of timeseries for, say, 
node_filesystem_avail. If you have 10,000 nodes, you will end up with roughly 100,000 timeseries for 
node_filesystem_avail, which is fine for Prometheus to handle.

If you were to now add quota per user, you would quickly reach a double digit number of millions with 
10,000 users on 10,000 nodes. This is too much for the current implementation of Prometheus. Even 
with smaller numbers, there's an opportunity cost as you can't have other, potentially more useful 
metrics on this machine any more.

He said that for the metric node_filesystem_avail, with ten thousand machines, each machine having ten data points, there would be a total of one hundred thousand points, which Prometheus can handle perfectly. But if you want to collect the quota of each of the ten thousand people on ten thousand machines, then there would be one hundred million data points, which is beyond the capacity.

Of course, what he meant was using labels instead of creating a separate metric for each stream. Typically, there are only a few dozen metrics, not hundreds or thousands.

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed 
that, aim to limit them to a handful across your whole system. The vast majority of your metrics should 
have no labels.

For example, rather than http_responses_500_total and http_responses_403_total, create a single metric 
called http_responses_total with a code label for the HTTP response code. You can then process the entire 
metric as one in rules and graphs.

If you want to categorize metric indicators, you should use label tags. For example, instead of defining two metrics http_responses_500_total and http_responses_403_total, it should be one metric http_responses_total with an additional code label.

I don't know if you have performance issues with Prometheus. How many machines do you have? How many routes/streams? How are the metrics defined? How are the labels defined?

TRANS_BY_GPT3

qiantaossx · 2022-08-15T11:34:57Z

This scenario:

Thousands of streams, collecting data every 10 seconds.
Collecting metrics for each playing stream, while adding several different labels for each stream.
- Label 1: Stream ID
- Label 2: Start time of the stream
- Label 3: End time of the stream
- Label 4: Other metrics of the stream...
Prometheus deployed on a single machine, storing data for 15 days, and displayed using Grafana.

In this scenario, when aggregating data using Grafana, there may be performance issues when matching and filtering through different labels. For example, querying all data for a specific stream using the stream ID label, querying all streams within a certain time period, querying all streams with poor network conditions, or querying streams with frequent reconnections from the streaming source.

TRANS_BY_GPT3

winlinvip · 2022-08-16T04:20:14Z

The metrics of Prometheus are generally suitable for aggregation, such as start time and end time, which are not suitable to be stored in Prometheus. These are suitable to be stored in log systems like ELK or APM/Trace. After processing and filtering with these systems, they can also be displayed through Grafana. For more details, you can refer to this article Metrics, tracing, and logging.

Generally speaking, Prometheus belongs to Metrics, which means it is used for alerting and aggregates many metrics. Therefore, the data stored in Prometheus is relatively small. For example, if there is an issue with the flow, the alert should collect the error count metric of the flow and aggregate it into normal flows and abnormal flows across the network.

Querying flows within a specific time period or analyzing flows with poor network conditions is more of a task for data analysis tools like ELK or APM. These tools are part of the operations system and should not rely solely on alerts, Prometheus, or metrics. Relying solely on these tools can lead to excessive usage and high system load, resulting in slow query performance.

Add me on WeChat to chat? We are currently designing the official SRS exporter and welcome your participation.

TRANS_BY_GPT3

winlinvip · 2022-12-25T05:43:03Z

In general, if it's not a hundred thousand streams or a million plays, Prometheus is completely capable.

Currently, SRS has supported Prometheus Exporter, and we will continue to add new metrics. Please refer to #2899.

TRANS_BY_GPT3

bianxg · 2023-07-17T03:17:13Z

Is there a conclusion yet?
https://github.com/bluenviron/mediamtx#metrics
This one has statistics for each flow. I don't know how many flows it can support for statistics.

TRANS_BY_GPT3

winlinvip · 2023-10-13T12:15:43Z

Update: For about 99% of use cases, which means virtually all scenarios, Prometheus can support stream-level monitoring data. SRS will gradually improve in the future.

TRANS_BY_GPT4

winlinvip self-assigned this Aug 15, 2022

winlinvip added the Discussion Discussion or questions. label Aug 15, 2022

winlinvip closed this as completed Nov 3, 2022

winlinvip added the Won't fix We won't fix it. label Nov 3, 2022

winlinvip changed the title ~~使用 prometheus 进行统计是否适合 SRS 的媒体流~~ Using Prometheus to monitor whether SRS is suitable for media streaming. Jul 28, 2023

winlinvip added the TransByAI Translated by AI/GPT. label Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Prometheus to monitor whether SRS is suitable for media streaming. #3141

Using Prometheus to monitor whether SRS is suitable for media streaming. #3141

qiantaossx commented Aug 10, 2022 •

edited by winlinvip

Loading

winlinvip commented Aug 15, 2022 •

edited

Loading

qiantaossx commented Aug 15, 2022 •

edited by winlinvip

Loading

winlinvip commented Aug 16, 2022 •

edited

Loading

winlinvip commented Dec 25, 2022 •

edited

Loading

bianxg commented Jul 17, 2023 •

edited by winlinvip

Loading

winlinvip commented Oct 13, 2023 •

edited

Loading

Using Prometheus to monitor whether SRS is suitable for media streaming. #3141

Using Prometheus to monitor whether SRS is suitable for media streaming. #3141

Comments

qiantaossx commented Aug 10, 2022 • edited by winlinvip Loading

winlinvip commented Aug 15, 2022 • edited Loading

qiantaossx commented Aug 15, 2022 • edited by winlinvip Loading

winlinvip commented Aug 16, 2022 • edited Loading

winlinvip commented Dec 25, 2022 • edited Loading

bianxg commented Jul 17, 2023 • edited by winlinvip Loading

winlinvip commented Oct 13, 2023 • edited Loading

qiantaossx commented Aug 10, 2022 •

edited by winlinvip

Loading

winlinvip commented Aug 15, 2022 •

edited

Loading

qiantaossx commented Aug 15, 2022 •

edited by winlinvip

Loading

winlinvip commented Aug 16, 2022 •

edited

Loading

winlinvip commented Dec 25, 2022 •

edited

Loading

bianxg commented Jul 17, 2023 •

edited by winlinvip

Loading

winlinvip commented Oct 13, 2023 •

edited

Loading