Support for probability sampling at-a-glance #2224

jmacd · 2021-12-14T22:05:46Z

What are you trying to achieve?

Probability sampling PR #2047 is waiting for reviews. I would like to help the community support this proposal.

#2047 adds support for building Span-to-Metrics pipelines; it is narrowly focused on how we count spans that have been probabilistically sampled. It does not attempt to solve all challenges for OpenTelemetry trace users that want sampling, so it may be a source of confusion that so much effort has been spent to specify a partial solution.

Why has this been proposed?

The current OpenTelemetry specification has a "TODO" written in it, along with a "WARNING". The TODO and the WARNING were written because the TraceIdRatioBased sampler specification is incomplete. The TraceIdRatioBased sampler was envisioned as a solution for consistent sampling, but it never delivered that.

We have a narrowly focused PR that stops short of fixing the TODO. Instead, it introduces new Sampler definitions that could in the future be swapped in to complete the TODO. The next step for the Sampling SIG is to work on configurable sampling and remote-configurable sampling. To that end, #2047 is only a building block that delivers a fix for the TODO/WARNING. #2047 is a technical replacement for TraceIdRatioBased, ParentBased, AlwaysOn, and AlwaysOff samplers. These four samplers meet the same requirements and support Span-to-Metrics.

Where is the user guide for Sampling?

#2047 is a specification to enable interoperability between Samplers and Span-to-metrics systems. It is no more a user-guide than the existing SDK specification of the built-in Samplers is. Although the lack of a user guide can be seen as a failing, the OTel specification has already failed to explain how it wants users to configure sampling. See also #1597 and #2113.

In a nutshell, how does this proposal work?

Today's W3C trace context does not specify that any of the 128 bits in a TraceId are true, uniform-distributed random bits. Lacking this certainty, the options for correct probability sampling were:

consider hashing the TraceId; however we found hashing is expensive, and a portable hashing algorithm of sufficient quality is not widely available
add additional randomness to the context.

#2047 takes the second approach, adds a new field known as "r-value" that carries a small amount of randomness in the OpenTelemetry tracestate. The amount of randomness is tailored to the number of supported sampling probabilities, which are limited to powers-of-two under this proposal. Thus r-value is contained in 6-bits of information.

#2047 uses the term "Adjusted count", which is how much we should add to the span count for each span when received. A field known as "p-value" in the tracestate conveys the context's adjusted count. Adjusted count equals 2**p under this proposal. Sampling probability is the inverse of adjusted count (i.e., probability == 2**-p).

By including r-value and p-value in the tracestate, these two values automatically propagate through the context and are recorded on every span. Because tracestate is automatically recorded, typically only a root Sampler needs to be changed to add Span-to-metrics support to a system. Trace consumers only need to recognize p-value to correctly count spans.

What alternatives have been considered?

As it stands, the cost to users who enable Span-to-metrics is not nearly as low as it could be. The options should be evaluated in terms of how many bytes are added to each context when sampled vs. not sampled. The current proposal requires r-value for both sampled and unsampled contexts. While p-value is only required for sampled contexts, this hardly matters given that r-value forces the introduction of tracestate for all contexts.

Restricting the choice of sampling probability to powers of two seems unnecessarily restrictive. An alternative, while more expensive, would eliminate the use of a base-2 logarithm so that adjusted count could be any integer in [1, 2**62], for example.

How can this be improved, working with W3C?

The W3C traceparent has unused trace flag bits that could be used without increasing the size of a context or requiring alternative header syntax. Suppose that traceparent had an option to specify at least 62 of the 128 TraceId bits be true random bits; "r-value" can be calculated from these TraceIDs, in which case tracestate is only needed for p-value--meaning, r-value would go away and tracestate would only be needed when sampled.

How can this be extended to support non-power of two probabilities?

If a proposal to change W3C traceparent were adopted, so that the r-value of #2047 were calculated from TraceID instead of encoded in the tracestate, then other schemes can be identified with consistent sampling behavior that could be mixed in the future.

Consider a hypothetical "c-value", an alternative to p-value that directly states the adjusted count. Probability equals 1/c. This can be consistently computed from 62 random bits so that the two schemes are equivalent at power-of-two probabilities. In #2047, r-value equals the number of leading bits in the 62-bit unsigned integer D and the sampling equation is sampled <=> p <= LeadingZeros(D). The alternative c-value defines a 62-bit threshold equal to 2**62 / c with the sampling equation sampled <=> D < 2**62 / c. (These schemes are equivalent when c = 2**p, I'm leaving out details about zero adjusted count.)

While this "c-value" proposal leads to a more human-readable tracestate, it is also more difficult to specify and currently substantially more expensive to transmit in the unsampled case. Considering the effort and reward, it makes sense to leave this option until after W3C supports built-in randomness. At that point, it will make sense to use either p-value with power-of-two probabilities or c-value for non-power-of-two, integer-reciprocal probabilities.

Conclusions?

With the above in mind (i.e., potential traceparent randomness, potential non-power-of-two support),#2047 should be seen as a compromise to let us demonstrate demand for Span-to-Metrics pipelines to use as evidence for adding traceparent randomness. Power-of-two probability sampling is relatively easy to specify and implement, at the same time it leaves a low commitment behind as trace consumers will only need to recognize p-value going forward (i.e., not r-value). #2047 asks us to agree that p-value will always be a compact way to represent power-of-two adjusted counts, whether or not we admit the hypothetical c-value or other schemes in the future.

The text was updated successfully, but these errors were encountered:

william-tran · 2021-12-16T16:21:05Z

Thank you for the context, this is really helpful.

bdarfler · 2021-12-16T18:43:12Z

This is great @jmacd! I'll take a look at lifting some key pieces back into #2047 soon so it's all in one place for the casual reader.

jmacd · 2022-01-07T16:32:21Z

Summarizing the state of #2047, especially for key reviewers @bogdandrutu @yurishkuro @open-telemetry/specs-trace-approvers:

The issue description above is still accurate. The proposal is a deliberate compromise meant to let us begin adoption of probability sampling while appealing to W3C to improve the traceparent for our needs.

This proposal introduces two experimental samplers that are meant as "drop-in" replacements for the built-in TraceIDRatioBased and ParentBased sampler, except that we have discovered obstacles (see #2179).

Please be aware that by accepting this proposal, we definitely leave open these possibilities for the future:

non-power-of-two sampling probabilities
alternate ways to state span adjusted counts
built-in support for probability sampling
always-on probability sampling

I expect few users will actually consume the samplers defined in this specification directly. The reason these experimental samplers will be useful immediately is that when we drop these in to a remote-configurable Sampler such as Jaeger's (and provided the composition rules specified here are implemented), the resulting Sampler will include probability sampling automatically, thus will support Span-to-metrics pipelines out of the box.

IOW support for #2047 means Jaeger remote sampling will immediately begin appealing to and could soon work for vendors with Span-to-metrics pipelines.

jmacd added the spec:trace Related to the specification/trace directory label Dec 14, 2021

github-actions bot assigned SergeyKanzhelev Dec 14, 2021

jmacd mentioned this issue Jan 5, 2022

Probability sampling in tracestate specification #2047

Merged

carlosalberto closed this as completed in #2047 Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for probability sampling at-a-glance #2224

Support for probability sampling at-a-glance #2224

jmacd commented Dec 14, 2021

william-tran commented Dec 16, 2021

bdarfler commented Dec 16, 2021

jmacd commented Jan 7, 2022 •

edited

Loading

Support for probability sampling at-a-glance #2224

Support for probability sampling at-a-glance #2224

Comments

jmacd commented Dec 14, 2021

william-tran commented Dec 16, 2021

bdarfler commented Dec 16, 2021

jmacd commented Jan 7, 2022 • edited Loading

jmacd commented Jan 7, 2022 •

edited

Loading