Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomness requirements following W3C Trace Context level 2 #4162

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 25, 2024

Changes

Updates Trace SDK and Propagator specifications with

  • List W3C propagator requirements (e.g., should propagate tracestate)
  • Introduce W3C Trace Context Level 2 w/ the Random flag
  • Define explicit randomness feature from OTEP 235, OTEP 261
  • Trace SDK default ID generator should include 56 bits of randomness in the correct location
  • Trace SDK for root spans: either set the random flag to confirm the above, or use an explicit randomness value.

Part of #1413.
Part of #3602.

Product of the Sampling SIG members @kentquirk @kalyanaj @oertl @PeterF778 and myself.

@jmacd jmacd changed the title Draft rules for span context in Trace Context level 2 Randomness requirements for W3C Trace Context level 2 Jul 26, 2024
@jmacd jmacd changed the title Randomness requirements for W3C Trace Context level 2 Randomness requirements following W3C Trace Context level 2 Jul 26, 2024
@jmacd jmacd marked this pull request as ready for review July 29, 2024 23:24
@jmacd jmacd requested review from a team July 29, 2024 23:24
Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See nit comment about text formatting, for diff

specification/trace/sdk.md Outdated Show resolved Hide resolved
Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/context/api-propagators.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Show resolved Hide resolved
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 15, 2024
@jmacd jmacd removed the Stale label Aug 15, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Aug 15, 2024

@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals.

Copy link
Member

@kentquirk kentquirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Sampling Sig seems to agree that this is ready now.

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 28, 2024
@jmacd jmacd removed the Stale label Aug 29, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Aug 29, 2024

@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals.

Copy link

github-actions bot commented Sep 6, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 6, 2024
@jmacd jmacd removed the Stale label Sep 6, 2024
Copy link
Member

@dyladan dyladan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Seems to work around the missing traceflag propagation nicely.

One question I have for you that I would take back to the w3c: with this in place, is the randomness trace flag actually providing any value? Seems like tracestate is being set all the time anyway, so the flag isn't serving the original purpose it was meant to serve.

@@ -355,6 +356,17 @@ Additional `Propagator`s implementing vendor-specific protocols such as AWS
X-Ray trace header protocol MUST NOT be maintained or distributed as part of
the Core OpenTelemetry repositories.

### W3C Trace Context Requirements

A W3C Trace Context propagator MUST parse and set the `traceparent` and `tracestate` HTTP headers as specified in [W3C Trace Context Level 2](https://www.w3.org/TR/trace-context-2/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply we always want to set tracestate regardless of the sampling strategy used. Is that the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised in 56d8e26. I want to say that both headers are validated, and that tracestate only propagates when it is not empty. Note however, that the larger proposal is for AlwaysOn samplers to set ot=th:0 on all spans that have 100% sampling; samplers that do not set a tracestate shouldn't cause tracestate to propatate.

- [Sampled](https://www.w3.org/TR/trace-context-2/#sampled-flag)
- [Random](https://www.w3.org/TR/trace-context-2/#random-trace-id-flag)

`TraceState` carries vendor-specific trace identification data, represented as a list of key-value pairs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we (the w3c) are trying to move away from "vendor" terminology to "tracing system" or similar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -466,6 +482,51 @@ The following configuration properties should be available when creating the sam
[jaeger-remote-sampling-api]: https://www.jaegertracing.io/docs/1.41/apis/#remote-sampling-configuration-stable
[jaeger-adaptive-sampling]: https://www.jaegertracing.io/docs/1.41/sampling/#adaptive-sampling

### Sampling Requirements

The [W3C Trace Context Level 2][W3CCONTEXTMAIN] Candidate Recommendation includes [a Random trace flag][W3CCONTEXTRANDOMFLAG] for indicating that the TraceID contains 56 random bits, specified for statistical purposes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone raised to my attention that 56 bits is actually a problem when using 64 bit floating points as the max safe integer is 2^53 - 1. I've been meaning to raise this with the w3c. It's not really an issue here because you're following the CR correctly, but I thought it was worth pointing out. Does this SIG have a take on that particular issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you raised this with the w3c yet? Is this an issue because Javascript uses floats for all numbers? Seems like a pretty important patch to be made so that we are all consistent. I had recently been trying to track down why in Erlang we were using 2^63 - 1, which was what some others were doing, but some were doing 2^64.

It'd be great to not only be consistent across languages but to document why a value is chosen so we can answer these questions.

And Javscript is a pretty big user base :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dyladan Is this issue specific to JavaScript? If so, for such use cases (that need to look at the last 56 bits of the traceid as an integer), couldn't something like BigInt be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should not be a problem. I assume the JS libraries have some way to create the randomness required for TraceIDs. It is possible to treat the random strings used for sampling as either bytes or hexadecimal strings and use lexicographical comparison as opposed to numerical comparison.

In an older prototype of the OTel Collector Contrib Sampling package I made a side-by-side comparison of the numerical vs lexicographical method, and IMO the numerical methods are a bit easier to read and performed a little better, so I kept the numerical form and removed the lexicographical implementation. I am having trouble finding that implementation, but there are comments left in pkg/sampling/encoding_test.go with a bit of history:

// There were two benchmarks used to choose the implementation for the
// Threshold type in this package.  The results indicate that it is
// faster to compare a 56-bit number than to compare as 7 element
// []byte.

// The current implementation, using unsigned:
//
//	BenchmarkThresholdCompareAsUint64-10    	1000000000	         0.4515 ns/op	       0 B/op	       0 allocs/op
//
// vs the tested and rejected, using bytes:
//
//	BenchmarkThresholdCompareAsBytes-10     	528679580	         2.288 ns/op	       0 B/op	       0 allocs/op

The ability to perform lexicographical comparison is the reason the reviewers have insisted that we specify lower-case hexadecimal for the threshold and randomness fields (the same is true in W3C Trace Context, traceparent requires lowercase hex too)-- you can perform use 100% string manipulations and comparison to implement this sampling logic.

This flag indicates that [the least-significant ("rightmost") 7 bytes or 56 bits of the TraceID are random][W3CCONTEXTTRACEID].

Note the Random flag does not propagate through [Trace Context Level 1][W3CCONTEXTLEVEL1] implementations, which do not recognize the flag.
Therefore, this flag is considered meaningful only when it is set on a root span context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag should be considered meaningful any time it is received. Level 1 will only set it to 0 so if you see 1, you know the sender is Level 2+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this sentence with "When this flag is 1, it is considered meaningful. When this flag is 0, it may be due to a non-random TraceID or because a Trace Context Level 1 propagator was used.".

@jpkrohling jpkrohling self-requested a review September 18, 2024 07:50
@@ -316,6 +324,14 @@ When asked to create a Span, the SDK MUST act as if doing the following in order
`Span` is created without an SDK installed or as described in
[wrapping a SpanContext in a Span](api.md#wrapping-a-spancontext-in-a-span).

#### Span flags

The OTLP representation for Span and Span Link include a 32-bit field declared as Span Flags.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "includes"?

@@ -316,6 +324,14 @@ When asked to create a Span, the SDK MUST act as if doing the following in order
`Span` is created without an SDK installed or as described in
[wrapping a SpanContext in a Span](api.md#wrapping-a-spancontext-in-a-span).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For list item (2):
(Note that the [built-in `ParentBasedSampler`](#parentbased) can be used to use the sampling decision of the parent, translating a set SampledFlag to RECORD and an unset one to DROP)

Does the built-in ParentBasedSampler, ..., translating a set SampledFlag to RECORD or RECORD_AND_SAMPLE?

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM, only a question about being more explicit in how SDKs can accept the randomness values set by API users.

@@ -87,6 +87,8 @@ formats is required. Implementing more than one format is optional.
| [Built-in `SpanProcessor`s implement `ForceFlush` spec](specification/trace/sdk.md#forceflush-1) | | | + | | + | + | + | + | + | + | + | |
| [Attribute Limits](specification/common/README.md#attribute-limits) | X | | + | | + | + | + | + | | | | |
| Fetch InstrumentationScope from ReadableSpan | | | + | | + | | | + | | | | |
| TraceID generator implements W3C Trace Context Level 2 randomness | | | | | | | | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a couple of implementations before this is marked as required?

The current version of the specification supports two flags:

- [Sampled](https://www.w3.org/TR/trace-context-2/#sampled-flag)
- [Random](https://www.w3.org/TR/trace-context-2/#random-trace-id-flag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this has been asked before, but are we comfortable recommending the draft? Or are we merging this only once it's marked as recommended?


#### TraceID randomness

For root span contexts, the SDK SHOULD implement the TraceID randomness requirements of the [W3C Trace Context Level 2][W3CCONTEXTTRACEID] Candidate Recommendation when generating TraceID values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this answers my previous question on whether we are OK using the CR :-)


#### Random trace flag

For root span contexts, the SDK SHOULD set the `Random` flag in the trace flags when it generates TraceIDs that meet the [W3C Trace Context Level 2 randomness requirements][W3CCONTEXTTRACEID].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any, but is there a disadvantage of using level 2 vs. level 1? Would I ever want to use level 1 when I'm aware of level 2?


#### User-defined explicit trace randomness

Trace SDKs MAY permit users to setup explicit randomness by entering it into the [`rv` sub-key of the OpenTelemetry TraceState][OTELRVALUE] of the context before creating a root span. This lets users have consistent sampling across traces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this spec define how this is accomplished, so that it's consistent across SDKs?

hexdigit = DIGIT ; a-f
```

The explicit randomness value is meant to be used instead of extracting randomness from TraceIDs, therefore it contains the same number of bits as a W3C Trace Context Level 2 recommends for TraceIDs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not belong to this doc, and I think I asked this already on another PR, but why would I do this? What are the use-cases behind this? I understand this is so that end-users can control the decisions, but do you have a specific use-case in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.