Multi-policy sampling #2113

yurishkuro · 2021-11-08T22:31:19Z

The recent OTEPs from @jmacd showed now a consistent sampling can be applied to multiple nodes using different sampling rates. I have another use case that is similar, but not the same. Our system supports sampling policies that could express not only the desired sampling rate, but also the desired verbosity, e.g. one policy may request the trace to collect CPU metrics, another may ask for function-level profiling (lots of internal spans). And those policies sometime can be attached to the same node in the architecture. Right now we're evaluating them independently (with independent coin flip).

This raises two problems:

Problem 1. The adjusted_count or "weight" is not a property of the whole span, but of a specific profiler. E.g. the same span can represent basic metrics like latency/QPS/errors sampled 1-in-10, and also CPU metrics sampled 1-in-100. So the weights for these two sets of data will be 10 and 100 respectively.

Problem 2. How do we use consistent sampling approach and assign the correct weights to each profiler? For example, assume we have two policies active at the same node that activate different profilers with different probabilities:

policy A, p=1/10
- basic profiler
policy B, p=1/100
- basic profiler
- CPU profiler

The weight for CPU profiler is unambiguous, w=100, but what about the weight for the basic profiler? Is is w=1/(0.1 + 0.01 + 0.1*0.01)?

@jmacd @oertl - curious to hear your thoughts

The text was updated successfully, but these errors were encountered:

oertl · 2021-11-09T08:40:29Z

If you are restricting to power of two sampling rates, like p = 1/8 = 1/2**3 for policy A and p = 1/64 = 1/2**6 for policy B, the R-value can be used to decide on the activity of policies. If we activate policy A whenever R >= 3, it will be active with a probability of 1/8. Similarly, policy B is activated whenever R >= 6 to achieve a probability of 1/64.

The probability that some profiler is active is given by the maximum activation probability of any policy that contains the given profiler. Therefore, the basic profiler is activated with a probability of max(1/8, 1/64) = 1/8 and the CPU profiler is activated with a probability of 1/64. The reciprocals of those probabilities would simply be the corresponding adjusted counts that need to be applied as weights to the basic and CPU data, respectively.

yurishkuro · 2022-02-08T04:05:42Z

@oertl can I pick your brain on another somewhat similar scenario? Suppose I want to have different sampling rates per endpoint of a service. Since the endpoint partitions all possible requests into disjoint sets, there's no issue with using w=1/p for each trace using the respective p assigned to that endpoint. But what if the dimension in question could have multiple values for a given trace, e.g. a user-flow-id passed down from the app? I can assign different probabilities to each value of user-flow-id, but when a trace arrives with a set of those values, how do I compute the weight? I initially thought it could be handled similar to your example below with multiple policies (i.e. w=1/(max p_i)), but that doesn't seem to be correct if I later want to extrapolate some metric (e.g. request count) using the weight for just one of the values of that dimension. The only correct way I can think of is by considering the space partitioning by all possible permutations of values of this dimension, which would drastically increase the number of independent probabilities to keep track of.

oertl · 2022-02-08T07:26:38Z

@yurishkuro If I understand correctly, you are thinking of a case where the individual spans of a trace do not have the same sampling probability. In this case, the trace may be incomplete (or partially sampled) and a weight for the whole trace cannot simply be determined. The weighting of the trace ultimately depends on the quantity of interest extracted from the trace. This is exactly what my paper is about https://arxiv.org/abs/2107.07703. The nice thing about consistent sampling is that it avoids the combinatorial explosion you mentioned.

yurishkuro · 2022-02-08T13:51:33Z

@oertl sorry I didn't explain well. I was still thinking of a traditional sampling that happens in one place (at the root) and then propagated with the trace, so it is consistent. But I want the probability of that decision to be based on attributes of the trace. The classic example of this is different sampling rates by the endpoint. Or by user type: free user vs. premium user. Basically, when the values of the attribute are independent, there is no issue. But in case of a user-flow-id attribute, say it can take values catalog, checkout, and prime (like amazon prime). The attribute itself is a set<string> and checkout can appear standalone, or in combination with prime. The users of tracing want to only set 3 probabilities, one for each flow, not for all possible combinations of the flows. The system could take max probability when a combination of flows is present in the trace, but I don't think the math would then work out at query time e.g. select sum(weight) from trace_roots where user_flow=X will not give accurate extrapolated count of requests containing flow X.

oertl · 2022-02-08T18:54:45Z

@yurishkuro I do not see a problem when taking the max probability. As long as each trace is finally weighted according to the inverse of its chosen sampling probability the estimate should be unbiased.

Here an example for clarification:

Assume sampling probabilities p_{catalog}=1/2, p_{checkout}=1/4, and p_{prime}=1/8 and that following traces with corresponding attributes have been sampled:

Trace_2 {catalog, checkout} was sampled with probability 1/2 (=max(p_{catalog}, p_{checkout}))
Trace_5 {prime, checkout} was sampled with probability 1/4
Trace_12 {prime} was sampled with probability 1/8
Trace_15 {catalog, prime} was sampled with probability 1/2
Trace_18 {checkout} was sampled with probability 1/4

The estimate of the number of traces with attributes {checkout} would be 2 + 4 + 4 = 10
The estimate of the number of traces with attribute {prime} would be 4 + 8 + 2 = 14
The estimate of the number of traces with attribute {prime, checkout} would be 4
The estimate of the number of traces with attribute {prime, catalog, checkout} would be 0
...

yurishkuro added the spec:trace Related to the specification/trace directory label Nov 8, 2021

github-actions bot assigned bogdandrutu Nov 8, 2021

jmacd mentioned this issue Dec 14, 2021

Support for probability sampling at-a-glance #2224

Closed

oertl mentioned this issue Jun 14, 2022

REQUEST: New membership for @oertl open-telemetry/community#1078

Closed

6 tasks

jmacd mentioned this issue Feb 27, 2024

Probability Samplers based on W3C Trace Context Level 2 #3910

Closed

5 tasks

jmacd mentioned this issue Apr 24, 2024

Project Tracking: Sampling #4012

Open

This was referenced Jul 25, 2024

Randomness requirements following W3C Trace Context level 2 #4162

Open

OpenTelemetry TraceIdRatioBased sampler requirements following OTEP 235 #4166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-policy sampling #2113

Multi-policy sampling #2113

yurishkuro commented Nov 8, 2021

oertl commented Nov 9, 2021

yurishkuro commented Feb 8, 2022

oertl commented Feb 8, 2022

yurishkuro commented Feb 8, 2022 •

edited

Loading

oertl commented Feb 8, 2022

Multi-policy sampling #2113

Multi-policy sampling #2113

Comments

yurishkuro commented Nov 8, 2021

oertl commented Nov 9, 2021

yurishkuro commented Feb 8, 2022

oertl commented Feb 8, 2022

yurishkuro commented Feb 8, 2022 • edited Loading

oertl commented Feb 8, 2022

yurishkuro commented Feb 8, 2022 •

edited

Loading