Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing system metrics conventions into the specification #818

Closed
jmacd opened this issue Aug 17, 2020 · 11 comments · Fixed by #937
Closed

Writing system metrics conventions into the specification #818

jmacd opened this issue Aug 17, 2020 · 11 comments · Fixed by #937
Assignees
Labels
priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory

Comments

@jmacd
Copy link
Contributor

jmacd commented Aug 17, 2020

What are you trying to achieve?

Trying to finalize host and runtime metrics instrumentation plugins for OTel-Go.

In theory this instrumentation should emit metrics like those the collector would if it were reporting on the same host or container. I've been reviewing the hostmetrics receiver and notice it has "process.cpu.time" while OTEP 119 uses "system.cpu.time" in its examples. (CC: @aabmass)

I would specify that "process.cpu.time" reflects the process itself and "system.cpu.time" reflects the host system. This is consistent with the current OTel collector hostmetrics, does it sound right?

The guidelines OTEP 119 should be copied into the specification on metrics semantic conventions.

Additional context.

open-telemetry/oteps#119 discusses conventional metric naming guidelines.

https://github.com/open-telemetry/opentelemetry-collector/tree/master/receiver/hostmetricsreceiver/internal/scraper/processscraper is the hostmetrics code that generates CPU timing metrics.

@jmacd jmacd added the spec:metrics Related to the specification/metrics directory label Aug 17, 2020
@aabmass
Copy link
Member

aabmass commented Aug 17, 2020

I would specify that "process.cpu.time" reflects the process itself and "system.cpu.time" reflects the host system. This is consistent with the current OTel collector hostmetrics, does it sound right?

Yes, I need to add this when copying over the OTEP. @jmacd can you assign this to me?

@aabmass
Copy link
Member

aabmass commented Aug 17, 2020

@james-bebbington also requested a process count metric open-telemetry/oteps#119 (comment):

I believe this PR is ready to be merged but when writing this up for the specs repo, it would be good to add a convention for process counts (with "state" = running / inactive)

@jmacd jmacd added priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Aug 19, 2020
@aabmass
Copy link
Member

aabmass commented Aug 26, 2020

Following up on #819, I think all of the UpDownSumObserver I put in OTEP 119 should be changed to ValueObserver.

The metrics API points out the similarity/confusion between UpDownSumObserver and ValueObserver with the example of measuring queue size:

Therefore, considering the choice between ValueObserver and UpDownSumObserver, the recommendation is to choose the instrument with the more-appropriate default aggregation. If you are observing a queue size across a group of machines and the only thing you want to know is the aggregate queue size, use SumObserver because it produces a sum, not a distribution. If you are observing a queue size across a group of machines and you are interested in knowing the distribution of queue sizes across those machines, use ValueObserver.

For most of those metrics, I'd imagine a distribution (summary) would be more valuable than an aggregate sum across machines.

@aabmass
Copy link
Member

aabmass commented Aug 26, 2020

Now I'm not sure if the same thing applies to the SumObserver instruments. E.g. for system.cpu.time, would you rather see the summed cpu time or a distribution of cpu time?

@jmacd
Copy link
Contributor Author

jmacd commented Aug 27, 2020

Usually when I look at system.cpu.usage (the OTEP 119 name, a.k.a. system.cpu.time the OTel-Collector name), I want to see the rate in terms of CPUs per second. That way, the number can be compared with the available CPUs. A distribution of CPU usage deltas would have to be divided by each measurement's interval (i.e., end_time - start_time), I think, to make it useful as a distribution. That is, in a sense, what SumObserver and UpDownSumObserver signify: that you can compute a meaningful rate.

We've sometimes referred to this as a structural difference, between ValueRecorder and SumObserver. For a SumObserver (Adding Structure), we know that the differences between M and N and between (M+Offset) and (N+Offset) are equivalent, because Additive measurements are linear (e.g., difference between 1000s and 2000s of CPU time is the same as between 2000 and 3000). This applies to .usage metrics, which I still think should be SumObserver when monotonic.

OTOH, a .utilization metric does not have the same property. The differences between M and N and between (M+Offset) and (N+Offset) are NOT necessarily equivalent (e.g., difference between 0.1 and 0.2 not the same as between 0.7 and 0.8).

Following up on #819, I think all of the UpDownSumObserver I put in OTEP 119 should be changed to ValueObserver.

I looked over the UpDownSumObserver metrics there, and:

  • Agree that all .utilization metrics should become ValueObserver
  • Still think .usage metrics should use SumObserver or UpDownSumObserver
  • No strong opinion for system.network.connections.

@aabmass
Copy link
Member

aabmass commented Aug 28, 2020

OTOH, a .utilization metric does not have the same property. The differences between M and N and between (M+Offset) and (N+Offset) are NOT necessarily equivalent (e.g., difference between 0.1 and 0.2 not the same as between 0.7 and 0.8).

I'm not sure if I follow how utilization metrics wouldn't be linear in the same way, since they are just value ratios. E.g. "there was a 10% increase in X". I do agree that they should be ValueObserver though, since they don't really represent a sum.

No strong opinion for system.network.connections.

@bogdandrutu, I'm not sure if I understood what makes this a sum. To measure it, you would just be observing the number of connections once each interval and export that value, where is the summation?

@aabmass
Copy link
Member

aabmass commented Aug 28, 2020

@jmacd after the SIG meeting, I also want to clarify why OTEP 119 calls it system.cpu.time and not usage. Here's what I put in the Semantic Conventions section of the otep:

  • usage - an instrument that measures an amount used out of a known total amount should be called entity.usage. For example, system.filesystem.usage for the amount of disk spaced used. A measure of the amount of an unlimited resource consumed is differentiated from usage. This may be time, data, etc.
    ...
  • time - an instrument that measures passage of time should be called entity.time. For example, system.cpu.time with varying values of label state for idle, user, etc.

The same thing goes for metrics like system.disk.operations, where you can't "use" a certain number of operations available, its unbounded. The other complication is for something like disk, you have system.disk.{io,operations,time} -- which one would you want to call "usage"?

I am happy to simplify it for the spec if anyone has a strong opinion against this convention. I'm not sure if these distinctions are even valuable.

@bogdandrutu
Copy link
Member

@bogdandrutu, I'm not sure if I understood what makes this a sum. To measure it, you would just be observing the number of connections once each interval and export that value, where is the summation?

The number of connections is a "sum" the entity that calculates that does a "sum" +/- 1 when a connection is open/close.

@bogdandrutu
Copy link
Member

In other words if you were to instrument directly the calls of Open/Close connection you would use an UpDownCounter, not a ValueRecorder

@kjordy
Copy link

kjordy commented Sep 1, 2020

@aabmass, regarding the naming when measuring time, I think that .time is clearer in denoting that it's a measurement of the number of seconds passed whereas .utilization is the ratio compared to the total time. It’s useful to see CPU as a rate in terms of CPUs per second (same thing jmacd said above), and I think CPU usage and utilization are both commonly used to name that rate metric. The Semantic Conventions section of the otep defines utilization as the ratio so I think it would be consistent for metrics that measure time to use .time and .utilization.

I'm unsure about the effects of using a sum aggregation on .usage. The metrics proto says these use a Sum type which "represents the type of a numeric int scalar metric that is calculated as a sum of all reported measurements over a time interval." But for things like filesystem.usage, I would want to see the output as a gauge. Is my interpretation of the conversation from the instrument to the metric type correct?

@jmacd
Copy link
Contributor Author

jmacd commented Sep 3, 2020

I also want to clarify why OTEP 119 calls it system.cpu.time and not usage

@aabmass I've come to understand this issue and definitely agree that .time is right when measuring a sum of duration measurements. I somehow had misunderstood and thought OTEP 119 suggested .usage and the collector was in disagreement by using .time.

"represents the type of a numeric int scalar metric that is calculated as a sum of all reported measurements over a time interval."

@kjordy the thing we are trying to define here somehow avoids talking about cumulative vs. delta measurements, and that's an area where language can probably be improved in several places. The Sum aggregation for a series of delta measurements is computed with addition, whereas the Sum aggregation for a series of cumulative measurements are simply point values.

@aabmass I've struggled with how to explain this idea of how to choose between Adding instruments, where measurements are "things we add", and Grouping instruments, where measurements are "things we average". One of the best resources I've found to explain these differences is "On the Theory of Scales of Measurement" (S. S. Stevens, 1946), which is old, findable, and worth reading.

The reason why @bogdandrutu states that connections should be sums, is that they are things where each unit is the same countable amount--they are "interval" measurements in the terminology of that paper, whereas utilization is a "ratio" measurement in the terminology of that paper. Each unit in a sum of connections is the same as another, whereas we can't say that about ratios.

Everyone: I'd like to improve the API specification on this matter, welcome suggestions & PRs. 😀

@aabmass I think we're at 💯 on moving OTEP 119 into the specification, where .utilization is a ValueObserver and .time is SumObserver, and .usage is usually UpDownSumObserver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants