-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomness requirements following W3C Trace Context level 2 #4162
base: main
Are you sure you want to change the base?
Conversation
fa9ec4c
to
71750cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See nit comment about text formatting, for diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…ication into jmacd/sampling_new
Thanks Co-authored-by: Robert Pająk <pellared@hotmail.com> Co-authored-by: Kent Quirk <kentquirk@gmail.com>
…pecification into jmacd/sampling_new
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
…ication into jmacd/sampling_new
@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Sampling Sig seems to agree that this is ready now.
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
…ication into jmacd/sampling_new
@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Seems to work around the missing traceflag propagation nicely.
One question I have for you that I would take back to the w3c: with this in place, is the randomness trace flag actually providing any value? Seems like tracestate is being set all the time anyway, so the flag isn't serving the original purpose it was meant to serve.
@@ -355,6 +356,17 @@ Additional `Propagator`s implementing vendor-specific protocols such as AWS | |||
X-Ray trace header protocol MUST NOT be maintained or distributed as part of | |||
the Core OpenTelemetry repositories. | |||
|
|||
### W3C Trace Context Requirements | |||
|
|||
A W3C Trace Context propagator MUST parse and set the `traceparent` and `tracestate` HTTP headers as specified in [W3C Trace Context Level 2](https://www.w3.org/TR/trace-context-2/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to imply we always want to set tracestate regardless of the sampling strategy used. Is that the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised in 56d8e26. I want to say that both headers are validated, and that tracestate only propagates when it is not empty. Note however, that the larger proposal is for AlwaysOn samplers to set ot=th:0
on all spans that have 100% sampling; samplers that do not set a tracestate shouldn't cause tracestate to propatate.
specification/trace/api.md
Outdated
- [Sampled](https://www.w3.org/TR/trace-context-2/#sampled-flag) | ||
- [Random](https://www.w3.org/TR/trace-context-2/#random-trace-id-flag) | ||
|
||
`TraceState` carries vendor-specific trace identification data, represented as a list of key-value pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we (the w3c) are trying to move away from "vendor" terminology to "tracing system" or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -466,6 +482,51 @@ The following configuration properties should be available when creating the sam | |||
[jaeger-remote-sampling-api]: https://www.jaegertracing.io/docs/1.41/apis/#remote-sampling-configuration-stable | |||
[jaeger-adaptive-sampling]: https://www.jaegertracing.io/docs/1.41/sampling/#adaptive-sampling | |||
|
|||
### Sampling Requirements | |||
|
|||
The [W3C Trace Context Level 2][W3CCONTEXTMAIN] Candidate Recommendation includes [a Random trace flag][W3CCONTEXTRANDOMFLAG] for indicating that the TraceID contains 56 random bits, specified for statistical purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone raised to my attention that 56 bits is actually a problem when using 64 bit floating points as the max safe integer is 2^53 - 1
. I've been meaning to raise this with the w3c. It's not really an issue here because you're following the CR correctly, but I thought it was worth pointing out. Does this SIG have a take on that particular issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you raised this with the w3c yet? Is this an issue because Javascript uses floats for all numbers? Seems like a pretty important patch to be made so that we are all consistent. I had recently been trying to track down why in Erlang we were using 2^63 - 1
, which was what some others were doing, but some were doing 2^64
.
It'd be great to not only be consistent across languages but to document why a value is chosen so we can answer these questions.
And Javscript is a pretty big user base :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dyladan Is this issue specific to JavaScript? If so, for such use cases (that need to look at the last 56 bits of the traceid as an integer), couldn't something like BigInt be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should not be a problem. I assume the JS libraries have some way to create the randomness required for TraceIDs. It is possible to treat the random strings used for sampling as either bytes or hexadecimal strings and use lexicographical comparison as opposed to numerical comparison.
In an older prototype of the OTel Collector Contrib Sampling package I made a side-by-side comparison of the numerical vs lexicographical method, and IMO the numerical methods are a bit easier to read and performed a little better, so I kept the numerical form and removed the lexicographical implementation. I am having trouble finding that implementation, but there are comments left in pkg/sampling/encoding_test.go
with a bit of history:
// There were two benchmarks used to choose the implementation for the
// Threshold type in this package. The results indicate that it is
// faster to compare a 56-bit number than to compare as 7 element
// []byte.
// The current implementation, using unsigned:
//
// BenchmarkThresholdCompareAsUint64-10 1000000000 0.4515 ns/op 0 B/op 0 allocs/op
//
// vs the tested and rejected, using bytes:
//
// BenchmarkThresholdCompareAsBytes-10 528679580 2.288 ns/op 0 B/op 0 allocs/op
The ability to perform lexicographical comparison is the reason the reviewers have insisted that we specify lower-case hexadecimal for the threshold and randomness fields (the same is true in W3C Trace Context, traceparent
requires lowercase hex too)-- you can perform use 100% string manipulations and comparison to implement this sampling logic.
specification/trace/sdk.md
Outdated
This flag indicates that [the least-significant ("rightmost") 7 bytes or 56 bits of the TraceID are random][W3CCONTEXTTRACEID]. | ||
|
||
Note the Random flag does not propagate through [Trace Context Level 1][W3CCONTEXTLEVEL1] implementations, which do not recognize the flag. | ||
Therefore, this flag is considered meaningful only when it is set on a root span context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag should be considered meaningful any time it is received. Level 1 will only set it to 0
so if you see 1
, you know the sender is Level 2+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced this sentence with "When this flag is 1, it is considered meaningful. When this flag is 0, it may be due to a non-random TraceID or because a Trace Context Level 1 propagator was used.".
@@ -316,6 +324,14 @@ When asked to create a Span, the SDK MUST act as if doing the following in order | |||
`Span` is created without an SDK installed or as described in | |||
[wrapping a SpanContext in a Span](api.md#wrapping-a-spancontext-in-a-span). | |||
|
|||
#### Span flags | |||
|
|||
The OTLP representation for Span and Span Link include a 32-bit field declared as Span Flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "includes"?
@@ -316,6 +324,14 @@ When asked to create a Span, the SDK MUST act as if doing the following in order | |||
`Span` is created without an SDK installed or as described in | |||
[wrapping a SpanContext in a Span](api.md#wrapping-a-spancontext-in-a-span). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For list item (2):
(Note that the [built-in `ParentBasedSampler`](#parentbased) can be used to use the sampling decision of the parent, translating a set SampledFlag to RECORD and an unset one to DROP)
Does the built-in ParentBasedSampler
, ..., translating a set SampledFlag to RECORD or RECORD_AND_SAMPLE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM, only a question about being more explicit in how SDKs can accept the randomness values set by API users.
@@ -87,6 +87,8 @@ formats is required. Implementing more than one format is optional. | |||
| [Built-in `SpanProcessor`s implement `ForceFlush` spec](specification/trace/sdk.md#forceflush-1) | | | + | | + | + | + | + | + | + | + | | | |||
| [Attribute Limits](specification/common/README.md#attribute-limits) | X | | + | | + | + | + | + | | | | | | |||
| Fetch InstrumentationScope from ReadableSpan | | | + | | + | | | + | | | | | | |||
| TraceID generator implements W3C Trace Context Level 2 randomness | | | | | | | | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be a couple of implementations before this is marked as required?
The current version of the specification supports two flags: | ||
|
||
- [Sampled](https://www.w3.org/TR/trace-context-2/#sampled-flag) | ||
- [Random](https://www.w3.org/TR/trace-context-2/#random-trace-id-flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this has been asked before, but are we comfortable recommending the draft? Or are we merging this only once it's marked as recommended?
|
||
#### TraceID randomness | ||
|
||
For root span contexts, the SDK SHOULD implement the TraceID randomness requirements of the [W3C Trace Context Level 2][W3CCONTEXTTRACEID] Candidate Recommendation when generating TraceID values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this answers my previous question on whether we are OK using the CR :-)
|
||
#### Random trace flag | ||
|
||
For root span contexts, the SDK SHOULD set the `Random` flag in the trace flags when it generates TraceIDs that meet the [W3C Trace Context Level 2 randomness requirements][W3CCONTEXTTRACEID]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any, but is there a disadvantage of using level 2 vs. level 1? Would I ever want to use level 1 when I'm aware of level 2?
|
||
#### User-defined explicit trace randomness | ||
|
||
Trace SDKs MAY permit users to setup explicit randomness by entering it into the [`rv` sub-key of the OpenTelemetry TraceState][OTELRVALUE] of the context before creating a root span. This lets users have consistent sampling across traces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this spec define how this is accomplished, so that it's consistent across SDKs?
hexdigit = DIGIT ; a-f | ||
``` | ||
|
||
The explicit randomness value is meant to be used instead of extracting randomness from TraceIDs, therefore it contains the same number of bits as a W3C Trace Context Level 2 recommends for TraceIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not belong to this doc, and I think I asked this already on another PR, but why would I do this? What are the use-cases behind this? I understand this is so that end-users can control the decisions, but do you have a specific use-case in mind?
Changes
Updates Trace SDK and Propagator specifications with
Part of #1413.
Part of #3602.
Product of the Sampling SIG members @kentquirk @kalyanaj @oertl @PeterF778 and myself.
CHANGELOG.md
spec-compliance-matrix.md