Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Extension not reliably adding user id and roles to span attributes #39563

Closed
vonatzigenc opened this issue Mar 19, 2024 · 24 comments · Fixed by #40466
Closed

OpenTelemetry Extension not reliably adding user id and roles to span attributes #39563

vonatzigenc opened this issue Mar 19, 2024 · 24 comments · Fixed by #40466
Assignees
Labels
area/tracing kind/bug Something isn't working
Milestone

Comments

@vonatzigenc
Copy link
Contributor

vonatzigenc commented Mar 19, 2024

Describe the bug

With the OpenTelemetry Extension, the property quarkus.otel.traces.eusp.enabled=true can be used to add the user id and the roles to the span attributes. (Based on Guide - Using OpenTelemetry - User data)

However, this is not reliable and there are spans where the attributes are missing.
I have found two different types of faulty behaviour:

  1. the attributes are set to an already closed span
  2. a ContextNotActiveException is thrown and logged.

In both cases, the attributes are missing on the exported span.

Since Quarkus 3.7.2 the second case occurs more often. (Most likely because of the change #38605)
With 3.7.1 there was no exception, but the attributes are written to the closed span.

The user data is read from the SecurityIdentity in io.quarkus.opentelemetry.runtime.exporter.otlp.EndUserSpanProcessor and set to the span.
The onStart method must non-blockin, which is why access to SecurityIdentity was implemented via a ManagedExecutor. See #34595 (comment) (Access to attributes of SecurityIdentity properties is blocking. See io.quarkus.security.runtime.SecurityIdentityProxy)
However, this can lead to the processing taking place too late.

Expected behavior

  • If a user exists, the two attributes are always set on all spans.
  • ContextNotActiveException does not occur

Actual behavior

With "slightly higher load" the attributes are partially missing

How to Reproduce?

Reproducer project: https://github.com/vonatzigenc/quarkus-otel-eusp-reproducer/tree/main

In reproducer:

  1. Execute test GreetingResourceTest
  2. Check in the logs:
    a) there are spans without the attributes (Is logged using a simple SpanExporter)
    b) (sometimes) ContextNotActiveException (Stacktrace is logged)
    c) (sometimes) an attempt was made to write attributes to closed spans (Log statement FINE [io.ope.sdk.tra.SdkSpan] (executor-thread-5) Calling setAttribute() on an ended Span.)

In general:

  1. Execute multiple calls of rest methods with an identity
  2. Check whether the attributes are set in all spans

Output of uname -a or ver

alpine 3.19.1

Output of java -version

openjdk version "21.0.1.0.101" 2024-01-16 LTS

Quarkus version or git rev

Tested with 3.8.2

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.5

Additional information

I tried to fix the error in the reproducer project.
However, I'm not sure if using CurrentIdentityAssociation instead of SecurityIdentity is legitimate.
(And whether the non-blocking variant is correct).
CurrentIdentityAssociationEndUserSpanProcessor.java

securityIdentityAssociation.getDeferredIdentity()

securityIdentityAssociation.getDeferredIdentity().subscribe().with(identity -> {
    span.setAllAttributes(
            identity.isAnonymous()
                    ? Attributes.empty()
                    : Attributes.of(
                    SemanticAttributes.ENDUSER_ID, identity.getPrincipal().getName(),
                    SemanticAttributes.ENDUSER_ROLE, identity.getRoles().toString()
            )
    );
});

instead of

managedExecutor.execute(
        () -> span.setAllAttributes(
                securityIdentity.isAnonymous()
                        ? Attributes.empty()
                        : Attributes.of(
                        SemanticAttributes.ENDUSER_ID, securityIdentity.getPrincipal().getName(),
                        SemanticAttributes.ENDUSER_ROLE, securityIdentity.getRoles().toString())
        )
);

If helpful I can create a PR with the customisation.

@vonatzigenc vonatzigenc added the kind/bug Something isn't working label Mar 19, 2024
@quarkus-bot
Copy link

quarkus-bot bot commented Mar 19, 2024

/cc @brunobat (opentelemetry), @radcortez (opentelemetry)

@brunobat
Copy link
Contributor

Thanks @vonatzigenc. This is a very complete report.
@michalvavrik, @sberyozkin What do you guys think about the suggested fix?

@michalvavrik
Copy link
Member

Hey,

I'm not sure if using CurrentIdentityAssociation instead of SecurityIdentity is legitimate.

In general, it is. Problem is that it will only solve your problems if the identity is already set there, which means that the span must be created after the Quarkus REST and similar started processing. I tried to lookup all cases where span can be created and I think it won't be true for every scenario. Bruno can prove me wrong....

ContextNotActiveException does not occur

Please note there has been changes around context activation in last released and I can't tell from top of my head in which micro it was fixed, but IMHO: current implementation activates CDI request context and access bean inside async method, which sometimes will be already after deactivation. It is wrong to expect there that the context will be active.

If helpful I can create a PR with the customisation.

I think your impl. is better, I don't like that it is unbounded to the request (impossible to solve inside SpanProcessor though), which means (in addition to resources) it will occasionally happen after the span ended? Also I don't believe it fixes all the issues.

However, this is not reliable and there are spans where the attributes are missing. I have found two different types of faulty behaviour

The feature can't be reliable written like this because it basically says whenever Span got created we will try to authenticate. But if @ActivateRequestContext actually activates the context, then it means that both CurrentIdentityAssociation and SecurityIdentity beans will return anonymous identity. No point of even trying to authenticate in such moment.

I can't tell how to solve this without digging more into how and when Spans can be created (all cases). Maybe quick fix could be:

  1. drop ctx activation
  2. use SpanProcessor#onEnd and just synchronously look if ctx active then get user from RoutingContext because if it is not there, it means the user wasn't authenticated.
  3. determine what to do if request wasn't authenticated, if it is desirable to authenticate just for sake of the Span information, then it might come down to what you are doing with subscribe to deferred identity, but it will be flaky

Proper fix should be to determine - if quarkus.otel.traces.eusp.enabled=true is set, it basically means proactive authenticate. In which case you can access the identity from the RoutingContext and add it to the span in cases when the RoutingContext is available (which on end should be in most if not all cases, not in gRPC etc. and there will be exceptions we need to handle). This way, you can avoid async stuff. If proative authentication enablement is not desirable, just add HTTP Security policy that authenticates and does nothing with it. This way, it will be accessible synchronously.

@vonatzigenc if you want to dig into this, please go ahead. Only thing is that I can't provide you more help without actually getting more familiar with the Quarkus OpenTelemetry. Hence my suggestion can't be reliable until I actually work on it.

@michalvavrik
Copy link
Member

Otherwise I can have a look in next weeks.

@michalvavrik
Copy link
Member

FYI I'd expect https://quarkus.io/version/main/guides/opentelemetry#quarkus-security-events to be very reliable if that helps to your use case.

@vonatzigenc
Copy link
Contributor Author

Thanks for the quick feedback and analysis.

if you want to dig into this, please go ahead.

I would also have to dig much deeper. (In routes and otel)

FYI I'd expect https://quarkus.io/version/main/guides/opentelemetry#quarkus-security-events to be very reliable if that helps to your use case.

The variant via the SecurityEvents looks promising.

The following code has already worked in a first test. (But I haven't looked at it more deeply yet)

void observeAuthenticationSuccess(@ObservesAsync AuthenticationSuccessEvent event) {
    // Access to span is based on io.quarkus.opentelemetry.runtime.tracing.security.SecurityEventUtil.addEvent(java.lang.String, io.opentelemetry.api.common.Attributes)
    Span span = Arc.container().select(Span.class).get();
    SecurityIdentity identity = event.getSecurityIdentity();
    span.setAllAttributes(
            identity.isAnonymous()
                    ? Attributes.empty()
                    : Attributes.of(
                    SemanticAttributes.ENDUSER_ID,
                    identity.getPrincipal().getName(),
                    SemanticAttributes.ENDUSER_ROLE,
                    identity.getRoles().toString()));
}

Would the idea be that this would be adapted in the extension? (i.e. as a replacement for the current EndUserSpanProcessor?
I.e. an enhancement for io.quarkus.opentelemetry.runtime.tracing.security.SecurityEventUtil or analogous to io.quarkus.opentelemetry.deployment.tracing.TracerProcessor#registerSecurityEventObserver?

@michalvavrik
Copy link
Member

I would also have to dig much deeper. (In routes and otel)

you can't rely on the Routingcontext, just one variant can when the Vert.x HTTP ext is present.

Would the idea be that this would be adapted in the extension? (i.e. as a replacement for the current EndUserSpanProcessor?

Not alone.

  • It is specific for Vert.x HTTP based extensions and it doesn't try to authenticate when the authentication is not required.
  • That is change to the current behavior that always try to authenticate (not authorize - allows anonymous identity)
  • AFAICT SpanProcessor adds identity to every created span, while that observer would only add the identity to current span

I.e. an enhancement for io.quarkus.opentelemetry.runtime.tracing.security.SecurityEventUtil or analogous to io.quarkus.opentelemetry.deployment.tracing.TracerProcessor#registerSecurityEventObserver?

No. It needs to be more complex. I can't tell without working on this and I can't tell all the caveats. I can do a review if you eventually create a PR. One thing I can tell is that current test coverage does not test important scenarios: with/without RBAC, with/without HTTP Perms, proactive enable/disabled, combinations of previous.

@geoand
Copy link
Contributor

geoand commented Mar 20, 2024

Looking at what we currently do for quarkus.otel.traces.eusp.enabled=true via EndUserSpanProcessor is just plain wrong - @michalvavrik has covered some of these points but I want to emphasize a few points:

  • Activating the request context (via @ActivateRequestContext) should never be done in this context.
  • The is a fundamental race condition where the the span attributes are set in an async fashion - this can never work correctly as the SpanProcessor#onStart will return immediately without the attributes being set when the API clearly expects to be set in a synchronous fashion

The way I see it, the proper solution is the following:

The span attributes should only be set in a synchronous fashion and that in turn means that these security attributes will only be set when the security information is present and can be determined in a synchronous fashion.

@michalvavrik
Copy link
Member

michalvavrik commented Mar 20, 2024

I agree with @geoand

The span attributes should only be set in a synchronous fashion and that in turn means that these security attributes will only be set when the security information is present and can be determined in a synchronous fashion.

Well, that's the point I wasn't sure about. The conclusion is fine if we document that to make this feature reliably reporting user if credentials are available, they will need proactive authentication enabled (because that's what is happening when lazy proactive auth is in place and SecurityIdentity proxy is used). That, together with required changes should do a trick for Quarkus REST, RESTEasy Classic and gRPC (on a same server).

FYI I also don't think this is good first issue @vonatzigenc. I am sure @brunobat and @sberyozkin would be happy to suggest many issues where you could help if you are interested in contributing. Just ping them if that is a case.

@geoand
Copy link
Contributor

geoand commented Mar 20, 2024

FYI I also don't think this is good first issue @vonatzigenc

It absolutely is not a good first issue :)

@brunobat
Copy link
Contributor

brunobat commented Mar 20, 2024

We need 2 different actions here.

  1. Mitigation. We need to catch that exception and make sure the span "isRecording()" before we try to set attributes. In case of exception, or not recording I think we should just log in debug mode. @vonatzigenc this would be a good first issue for you.
  2. Actual fix. It should be sync, however this can be a can of worms and make the feature very complex and intrusive. @michalvavrik do we really want proactive authentication once this property is enabled? What would be the side effects for the user?

On a side note, we might have a similar problem on the security events collection. We need to make sure the span isRecording before we try to add events to the span. In that case we should also log in debug.

@michalvavrik
Copy link
Member

michalvavrik commented Mar 20, 2024

@michalvavrik do we really want proactive authentication once this property is enabled? What would be the side effects for the user?

Probably no, I adjusted my expectations based on Georgios comment, we just need to document expectations. However as I mentioned, please note that when you inject the identity and call isAnonymous, it is authenticating and in case this is disabled proactive auth, then it is actual authentication (not handling anonymous identity). Which means this differs to what Georgios suggested (add the user info when it is already authenticated).

My suggestion is that I need to dig into this to understand it properly. I'll provide a fix within a week.

@michalvavrik michalvavrik self-assigned this Mar 20, 2024
@michalvavrik
Copy link
Member

On a side note, we might have a similar problem on the security events collection. We need to make sure the span isRecording before we try to add events to the span. In that case we should also log in debug.

I didn't know about isRecording, there is check isValid before the events are added there, but I need to get familiar with it and check. Thanks for a note.

@geoand
Copy link
Contributor

geoand commented Mar 20, 2024

ΙΜΗΟ, if we can't provide a proper fix along the lines described, we should remove the feature altogether as it's very broken currently.

@michalvavrik
Copy link
Member

ΙΜΗΟ, if we can't provide a proper fix along the lines described, we should remove the feature altogether as it's very broken currently.

+1

@brunobat
Copy link
Contributor

Even with a proper fix it's very likely that we will not be able to provide the authentication data in all cases, for all execution flows... And this is fine.
It's better to have this information in some spans than not having it at all.
We need to adjust perception and documentation accordingly, by providing a clear set of flows where authentication data will be provided. Saying it's best effort and set to "could not resolve" could be a solution.
Would this be acceptable?

Other option we can explore is trying to extract that inf from the security events, when available.

@michalvavrik
Copy link
Member

We need to adjust perception and documentation accordingly, by providing a clear set of flows where authentication data will be provided. Saying it's best effort and set to "could not resolve" could be a solution.
Would this be acceptable?

sounds to me, thanks

Other option we can explore is trying to extract that inf from the security events, when available.

This feature can be reliable, we just need to rewrite it and document which setup provides information always and which only on documented scenarios. I don't think we can extract it from security events because of context propagation would be expensive (using duplicated context local data) or because of cardinality (add it to every created span vs current span when authentication happened).

@geoand
Copy link
Contributor

geoand commented Mar 20, 2024

It's better to have this information in some spans than not having it at all.

This is your domain, so you know best. I'm just saying that this sounds super weird to me and I would not have opted to include such behavior in an analogous situation in some of the extensions I maintain.

@brunobat
Copy link
Contributor

Slept ove it... This will cause much frustration and questions, like it is now.
Will remove this feature later today.

@michalvavrik michalvavrik removed their assignment Mar 21, 2024
@michalvavrik
Copy link
Member

I've removed my assignment and will let it to you guys, but personally I don't think you should make conclusions without thorough analysis. Maybe you did it, I don't know.

I don't know until I can study Quarkus OTEL internals as I said whether it can be done for sure. My plan was to leverage fact that Vert.x HTTP route event is handled on duplicated context (mainstream scenario) so OTEL context storage is also stored on duplicated context. before HTTP authenticator starts to authenticate when proactive authentication is enabled, the OTEL context is already stored on the Vert.x duplicated context. I planned to investigate if when the identity is already in the RoutingContext, we can atttach the RoutingContexct to the OTEL context and only use the latest identity added to the context when Span ends. That said, we have various places where the identity can be augmented and I can't tell if Span cannot be ended before the latest roles or principal name was added. Anyway that would be a way to do stuff synchronously. There can be things that I don't see now which would block this plan.

@brunobat
Copy link
Contributor

Please do investigate and I still think we need the feature, even if it doesn't work for all flows. We just need to be more precise on the documentation and on how will be the user experience.
In the meantime it will be burried

@michalvavrik
Copy link
Member

Sounds good!

@michalvavrik
Copy link
Member

michalvavrik commented Mar 23, 2024

There is a thing I didn't know - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#onendspan says that inside the io.opentelemetry.sdk.trace.SpanProcessor#onEnd you can't modify the Span. I'll try look for other options :-(

@michalvavrik
Copy link
Member

This is working - 8f7673d. I need to write some more tests and handle edge cases next week, but you can see that current test coverage is already big enough for knowing this concept is alright.

The fix however, won't be backportable as it relys on code that I changed so many times in past releases it would be merge hell.

@michalvavrik michalvavrik self-assigned this Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracing kind/bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants