Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Prometheus to monitor whether SRS is suitable for media streaming. #3141

Closed
qiantaossx opened this issue Aug 10, 2022 · 6 comments
Closed
Assignees
Labels
Discussion Discussion or questions. TransByAI Translated by AI/GPT. Won't fix We won't fix it.

Comments

@qiantaossx
Copy link

qiantaossx commented Aug 10, 2022

When using Prometheus, our scenario is to collect statistics such as bitrate and fps for each video stream in real-time. We have deployed our own Prometheus instance, but it is limited by the storage capacity of a single machine. Therefore, we use Grafana for visualization.

During usage, we found that the data volume is too large, and Prometheus easily encounters performance bottlenecks. We would like to discuss whether Prometheus is only suitable for collecting information from the entire set, and not suitable for monitoring the status of each individual stream.

TRANS_BY_GPT3

@winlinvip winlinvip self-assigned this Aug 15, 2022
@winlinvip winlinvip added the Discussion Discussion or questions. label Aug 15, 2022
@winlinvip
Copy link
Member

winlinvip commented Aug 15, 2022

If you look at the example provided by Prometheus, Use Labels

To give you a better idea of the underlying numbers, let's look at node_exporter. node_exporter exposes 
metrics for every mounted filesystem. Every node will have in the tens of timeseries for, say, 
node_filesystem_avail. If you have 10,000 nodes, you will end up with roughly 100,000 timeseries for 
node_filesystem_avail, which is fine for Prometheus to handle.

If you were to now add quota per user, you would quickly reach a double digit number of millions with 
10,000 users on 10,000 nodes. This is too much for the current implementation of Prometheus. Even 
with smaller numbers, there's an opportunity cost as you can't have other, potentially more useful 
metrics on this machine any more.

He said that for the metric node_filesystem_avail, with ten thousand machines, each machine having ten data points, there would be a total of one hundred thousand points, which Prometheus can handle perfectly. But if you want to collect the quota of each of the ten thousand people on ten thousand machines, then there would be one hundred million data points, which is beyond the capacity.

Of course, what he meant was using labels instead of creating a separate metric for each stream. Typically, there are only a few dozen metrics, not hundreds or thousands.

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed 
that, aim to limit them to a handful across your whole system. The vast majority of your metrics should 
have no labels.

For example, rather than http_responses_500_total and http_responses_403_total, create a single metric 
called http_responses_total with a code label for the HTTP response code. You can then process the entire 
metric as one in rules and graphs.

If you want to categorize metric indicators, you should use label tags. For example, instead of defining two metrics http_responses_500_total and http_responses_403_total, it should be one metric http_responses_total with an additional code label.

I don't know if you have performance issues with Prometheus. How many machines do you have? How many routes/streams? How are the metrics defined? How are the labels defined?

TRANS_BY_GPT3

@qiantaossx
Copy link
Author

qiantaossx commented Aug 15, 2022

This scenario:

  1. Thousands of streams, collecting data every 10 seconds.
  2. Collecting metrics for each playing stream, while adding several different labels for each stream.
    • Label 1: Stream ID
    • Label 2: Start time of the stream
    • Label 3: End time of the stream
    • Label 4: Other metrics of the stream...
  3. Prometheus deployed on a single machine, storing data for 15 days, and displayed using Grafana.

In this scenario, when aggregating data using Grafana, there may be performance issues when matching and filtering through different labels. For example, querying all data for a specific stream using the stream ID label, querying all streams within a certain time period, querying all streams with poor network conditions, or querying streams with frequent reconnections from the streaming source.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Aug 16, 2022

The metrics of Prometheus are generally suitable for aggregation, such as start time and end time, which are not suitable to be stored in Prometheus. These are suitable to be stored in log systems like ELK or APM/Trace. After processing and filtering with these systems, they can also be displayed through Grafana. For more details, you can refer to this article Metrics, tracing, and logging.

Generally speaking, Prometheus belongs to Metrics, which means it is used for alerting and aggregates many metrics. Therefore, the data stored in Prometheus is relatively small. For example, if there is an issue with the flow, the alert should collect the error count metric of the flow and aggregate it into normal flows and abnormal flows across the network.

Querying flows within a specific time period or analyzing flows with poor network conditions is more of a task for data analysis tools like ELK or APM. These tools are part of the operations system and should not rely solely on alerts, Prometheus, or metrics. Relying solely on these tools can lead to excessive usage and high system load, resulting in slow query performance.

Add me on WeChat to chat? We are currently designing the official SRS exporter and welcome your participation.

TRANS_BY_GPT3

@winlinvip winlinvip added the Won't fix We won't fix it. label Nov 3, 2022
@winlinvip
Copy link
Member

winlinvip commented Dec 25, 2022

In general, if it's not a hundred thousand streams or a million plays, Prometheus is completely capable.

Currently, SRS has supported Prometheus Exporter, and we will continue to add new metrics. Please refer to #2899.

TRANS_BY_GPT3

@bianxg
Copy link

bianxg commented Jul 17, 2023

Is there a conclusion yet?
https://github.com/bluenviron/mediamtx#metrics
This one has statistics for each flow. I don't know how many flows it can support for statistics.

TRANS_BY_GPT3

@winlinvip winlinvip changed the title 使用 prometheus 进行统计是否适合 SRS 的媒体流 Using Prometheus to monitor whether SRS is suitable for media streaming. Jul 28, 2023
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 28, 2023
@winlinvip
Copy link
Member

winlinvip commented Oct 13, 2023

Update: For about 99% of use cases, which means virtually all scenarios, Prometheus can support stream-level monitoring data. SRS will gradually improve in the future.

TRANS_BY_GPT4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion or questions. TransByAI Translated by AI/GPT. Won't fix We won't fix it.
Projects
None yet
Development

No branches or pull requests

3 participants