Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-power-of-two consistent tail probability sampling in TraceState #226

Closed
205 changes: 205 additions & 0 deletions text/trace/0226-sampling-random-traceids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Non-power-of-two Probability Sampling using 56 random TraceID bits

## Motivation

The existing, experimental [specification for probability sampling using TraceState](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is lot of good information in the existing experimental spec (e.g., on what is consistent probability sampling, why it is needed, what does "adjusted count" mean, what does "0" adjusted count mean etc.). Do you plan to bring forward some of that content over here? Want to make sure we can still preserve/standardize the parts of content (that are still applicable even with this proposal) even if/when that spec gets deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to start with the existing specification and modify it (i.e., https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md) so it should start with more background. I appreciate your pointing out that this information is lost in this OTEP.

supporting Span-to-Metrics pipelines is limited to powers-of-two
probabilities and is designed to work without making assumptions about
TraceID randomness.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

Head sampling requires the use of TraceState to propagate context from
the parent for recording in child spans, in support of Span-to-Metrics
pipelines. Tail sampling does not require context propagation
jmacd marked this conversation as resolved.
Show resolved Hide resolved
support, but it has many similar requirements:

1. Sampling should be "consistent", so that independent collection
paths make identical sampling decisions.
2. Spans should be countable in a Span-to-Metrics pipeline, which
jmacd marked this conversation as resolved.
Show resolved Hide resolved
requires knowing the "adjusted count" for each span directly from
the data.

This OTEP makes use of the [draft-standard W3C tracecontext `random`
flag](https://w3c.github.io/trace-context/#random-trace-id-flag),
which is an indicator that 7 bytes of true randomness are available
for probability sampler decisions.

This proposes to create a specification with support for 56-bit
precision tail sampling. This is seen as particularly important for
implementation of probabilistic tail samplers (e.g., in the
OpenTelemetry Collector) as explained below.

## Explanation

The existing, experimental TraceState probability sampling
specification relies on two variables known as **r-value** and
**p-value**. The r-value carries the source of randomness and the
p-value carries the effective sampling probability. The preceding
specification recommends the use of interpolation to achieve
non-power-of-two sampling probabilities.

This specification is proposed that aims to offer an alternative to
that r-value, p-value specification, one that is simpler to implement,
can be used in both head- and tail-samplers, and that naturally
jmacd marked this conversation as resolved.
Show resolved Hide resolved
supports non-power-of-two sampling probabilities.

This proposal uses the 7 bytes of intrinsic randomness in the TraceID,
the ones (draft-) specified [in the W3C tracecontext `random`
flag](https://w3c.github.io/trace-context/#random-trace-id-flag). With
these bits, a simple threshold test is defined to allow sampling based
on TraceID randomness.

This document proposes extending the p-value, r-value mechanism with
support for a new indicator for non-power-of-two probability sampling
known as "t-value", where "t" is chosen because it signifies a
threshold. Tail-based sampling encoded by t-value can be combined
with p-value, in which case the adjusted count implied by t-value is
**multiplied** with the adjusted count implied by p-value because they
are independent mechanisms.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

### Detailed design

Support for Span-to-Metrics pipelines requires knowing the "adjusted
count" of every collected span. This proposal defines the sampling
"threshold" as a 7-byte string used to make consistent sampling
decisions, as follows.

1. Bytes 9-16 of the TraceID are interpreted as a 7-byte unsigned
value in big-endian byte order.
2. If the unsigned value determined by the trace is less-than
to the sampling threshold, the span is sampled, otherwise it is
discarded.

To calculate the Sampling threshold, we begin with an IEEE-754
standard double-precision floating point number. With 52-bits of
significand and a floating exponent, the probability value used to
Copy link
Contributor

@oertl oertl Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 52-bits of significand...

Double-precision floating-point values have a 52-bit mantissa but are able to represent 53-bit significands (except for subnormal values). See https://cs.stackexchange.com/a/152267/102560.

calculate a threshold may be capable of representing more-or-less
precision than the sampler can execute.

We have many ways of encoding a floating point number as a string,
some of which result in loss of precision. This specification dicates
exactly how to calculate a sampling threshold from a floating point
number, and it is the sampling threshold that determines exactly the
effective sampling probability. The conversion between sampling
probability and threshold is not exactly reversible, so to determine
the sampling probability exactly from an encoded t-value, first

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is the first reference to t-value in this document, but t-value hasn't been introduced yet. Update: Above, I have proposed a short high-level introduction to t-value.

[Overall my general feedback is that it would be good to first explain the 10,000 foot view of the new proposal before this section which dives too much into the low-level details on the exact calculation approach.]

compute the exact sampling threshold, then use the threshold to derive
the exact sampling probability.

From the exact sampling probability, we are able to compute (subject
to machine precision) the adjusted count of each span. For example,
given a sampling probability encoded as "0.1", we first compute the
nearest base-2 floating point, which is exactly 0x1.999999999999ap-04,
which is approximately 0.10000000000000000555. The exact quantity in
this example, 0x1.999999999999ap-04, is multipled by `2^56` and
rounded to an unsigned integer (7205759403792794). This specification
says that to carry out sampling probability "0.1", we should keep
exactly 7205759403792794 smallest unsigned values of the 56-bit random
TraceID bits.

## T-value encoding for adjusted counts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to define the mutation rules and propagation rules for t-value. E.g., something on the lines of:

  • if a participant is doing parent-based sampling, it should propagate the t-value from its parent.
  • if a participant is doing consistent probability sampling using its own sampling rate, it should mutate the t-value to set the new adjusted count / sampling rate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite answering your question, but I've prototyped open-telemetry/opentelemetry-collector-contrib#22058 with a different sort of answer to your question.

In this case referring to span data records, where there are multiple collectors in a pipeline. The first collector may sample at 1/10; when a subsequent collector samples at 1/20, the t-value of the selected spans will be updated. If the subsequent collector samples at 1/2, however, it is being less selective than the first collector, so it should not modify the t-value. That is to say that t-value adjusted counts should not fall and t-valued probabilities should not rise.

See the logic here: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/22058/files#diff-33f10350e2875f926dd2be6fc4c6bb88cfd8043cf6ac6d100295cf654771d90dR210-R219

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a problem with such sampling behavior. Let's assume that the previous collector in chain sampled all traces with errors with probability 1, and all remaining traces with 1/100. If the next collector in chain is configured with 1/10, it will not touch the healthy traces, but will decimate the traces with errors. So any stratified sampling logic must be known and repeated by all collectors in the pipeline. Even if we prohibit stratified sampling, to set up a collector sampling probability in any meaningful way we have to know the minimum sampling probability of all the preceding collectors.


The example used sampling probability "0.1", which is a concisely
rounded value but not exactly a power of two. The use of decimal
floating point in this case conceals the fact that there is an integer
reciprocal, and when there is an integer reciprocal there are good
reasons to preserve it. Rather than encoding "0.1", it is appealing
to encode the adjusted count (i.e., "10") because it conveys exactly
the user's intention.

This suggests that the t-value encoding be designed to accept either
the sampling probability or the adjusted count, depending on how the
sampling probability was derived. Thus, the proposed t-value shall be
parsed as a floating point or integer number using any POSIX-supported
printf format specifier. Values in the range [0x1p-56, 0x1p+56] are
valid. Values in the range [0x1p-56, 1] are interpreted as a sampling
probability, while values in the range [1, 0x1p+56] are intepreted as
an adjusted count. Adjusted count values must be integers, while
sampling probability values can be arbitrary floating point values.

Whether to encode sampling probabilty or adjusted count is a choice.
In both cases, the interpreted value translates into an exact
threshold, which determines the exact inclusion probability. From the
exact inclusion probability, we can determine the adjusted count to
use in a span-to-metrics pipeline. When the t-value is _stated_ as an
adjusted count (as opposed to a sampling probabilty), implementations
can use the integer value in a span-to-metrics pipeline. Otherwise,
implementations should use an adjusted count of 1 divided by the
sampling probability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor thing, but perhaps a section describing how to encode powers of two sample probabilities would be helpful. Since I am not 100% familiar with the POSIX-supported printf format, I wonder what would be the most efficient way. For example, if the sampling probability is 2^(-20) (corresponding to p=20), we could write t=0x1p-20 or t=1048576, but would t=0x1p+20 or even t=0x1p20 be allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is that powers of two sampling probabilities are a natural discretization for me, since this is the only discretization that results in integer adjusted counts while the relative spacing is constant. Thus, I believe we will often see t-values that are powers of two. Therefore, it might be useful to define a more compact representation of the t-value if it is a power of two. Possibly it makes sense to keep the p-value?

## Where to store t-value in a Span and/or Log Record

Although prepared as a solution for tail sampling, the t-value
encoding scheme could also be used to convey Logs sampling. While
tail sampling does not require the use of trace state, which is
associated with context propagation, it makes a natural place to store
t-value because it should be interpreted along with p-value, which
resides in the trace state. However, if spans store t-value in trace
state, it is not clear how to convey logs sampling.

Here are ways to address this:

1. Store t-value in a new dedicated field in the Span or Log Record
(as a string). (Author's preference.)
2. Store t-value as a Span or Log Record attribute (as a string).
This may cause confusion because the attribute, which was not
applied by a user, can change long the collection path even though
the data has not changed.
3. Store t-value as an optional floating point field in the Span or
Log Record. An optional field is required because we need a
meaningful way to represent zero probability, for cases where spans
are exporter due to a non-probabilistic decision.
4. Create a new field in both Spans and Log Records as a dedicated
field for storing t-values.

The benefit of using TraceState is that it is an extensible field,
made for multiple vendors to place arbitrary contents. It is not
clear whether use of tracestate to record collection-time decisions is
appropriate, or whether it is only meant for in-band context
propagation. If this use-case is acceptable, the name Trace State
would become a legacy; in this case, a more signal-neutral name for
the field could be developed (e.g., "Collection State")

### 90% sampling

The following header

```
tracestate: ot=t:0.9
```

### 1-in-3 sampling

The following header

```
tracestate: ot=t:3
```

corresponds with 1-in-3 sampling.

### 25% head sampling, 1-in-10 tail sampling

The following header

```
tracestate: ot=p:2;t:10
jmacd marked this conversation as resolved.
Show resolved Hide resolved
```

corresponds with 1-in-4 sampling at the head and 1-in-10 tail
sampling. The resulting span has adjusted count 40.

## Trade-offs and mitigations

Support for encoding t-value as either a probability or an adjusted
count is meant to give the user control over loss of precision. At
the same time, it can be read by humans.

Floating point numbers can be encoded exactly to avoid ambiguity, for
example, using hexadecimal floating point representation. Likewise,
adjusted counts can be encoded exactly as integers to convey the
user's intended sampling probability without floating point conversion
loss.

## Prior art and alternatives

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Towards the end, we may want to call out that one benefit of the r-value based randomness was that it could be used to get consistent sampling across multiple traces (e.g., all traces started within a time window by a participant) - it would be good to call out that it should be possible to support it in the future as a complement to the current proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to use arbitrary sampling probabilities, we should not use the current definition of the r-value. It makes no sense to have different discretizations for the r-value (powers of two) and for the t-value (56-bit values). Therefore, the r-value should rather be a 14-digit hex value that overrides the random bits of the trace ID, if present. This way we could also handle traces where the random flag is not set in the trace context. If the flag is not set and there is also no r-value, we could require consistent samplers to set the r-value by generating a 56-bit random value.


An earlier draft of proposal was explored [here](https://github.com/jmacd/opentelemetry-collector-contrib/pull/2925).