Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Integrate e2e tracing for query vectors endpoint #1991

Merged
merged 5 commits into from
Apr 23, 2024

Conversation

sanketkedia
Copy link
Contributor

Description of changes

This PR integrates end-to-end tracing for the Query Service (specifically the query_vectors() RPC). The main highlights are:

  • Span inheritance by component - whenever a component is started the executor executes the run() function inside a child span with the current span as the parent. This works even across threads now for e.g. if the component was spawned in inherited mode but it invoked the receive msg handler on a different thread, etc.
  • Span propagation across components. For e.g. when the HNSW Query orchestrator submits the Pull logs request or brute force KNN request to the worker (via the dispatcher), the worker invokes the task handler inside a child span with parent span as the HNSW orchestrator. This is implemented by adding a new field to the Task struct - the span id of the parent and the worker then creates a child span with this id as the parent.

Test plan

  • [+] Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Check https://gist.github.com/sanketkedia/96f3b00a2037a0fb2e99c16c3a379e69 for the exact logs for query_vectors() rpc

Documentation Changes

None required

Copy link

github-actions bot commented Apr 9, 2024

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@@ -2,5 +2,8 @@ use worker::query_service_entrypoint;

#[tokio::main]
async fn main() {
tracing_subscriber::fmt()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, was this the fix for that issue? What is it by default.

@nicolasgere @beggers - how do you think we should handle the log level config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely less than info level since info level logs were not getting printed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you think we should handle the log level config?

Could you be more specific? Like, how should we configure it, how should we propagate it, how should we think about different logging levels, something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we configure it and how we use the levels

@sanketkedia sanketkedia changed the title Integrate e2e tracing for query vectors endpoint [ENH] Integrate e2e tracing for query vectors endpoint Apr 13, 2024
Some(channel) => match channel.reply_to.send(task).await {
Some(channel) => match channel
.reply_to
.send(task, Some(Span::current().clone()))
Copy link
Collaborator

@HammadB HammadB Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity - is cloning spans the only option here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I gave this some thought. I was running into an issue where the dispatcher would pass in a span to the worker and by the time the worker executed with this span as a child, the parent span gets GC'd because the RAII guard that the dispatcher creates goes out of scope. So, we need some ref counting at the very least here.

I thought of trying out Arc then but the problem with that is the genesis span that I instrument at the beginning of the query_vectors RPC in query_service is created automatically by tracing library (since I instrument it with the #instrument wrapper) so I can't create an Arc for that. Hence, cloning.

@@ -51,7 +51,7 @@ where
{
async fn run(&self) {
let output = self.operator.run(&self.input).await;
let res = self.reply_channel.send(output).await;
let res = self.reply_channel.send(output, None).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to propagate a context here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious about this one. What information is missing if we do pass in the tracing context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought process here was that this sends a reply back to the original component that had asked to execute this task. And that component should already be executing inside a span (after all that is how it passed its context to the worker which is now sending a reply).

Maybe there are patterns other than this for which this needs a non None value but I can come back to those once I build more context of the system?

@@ -106,7 +106,7 @@ impl CompactOrchestrator {
};
let input = PullLogsInput::new(collection_id, 0, 100, None, Some(end_timestamp));
let task = wrap(operator, input, self_address);
match self.dispatcher.send(task).await {
match self.dispatcher.send(task, None).await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we want to propagate here?

Copy link
Contributor Author

@sanketkedia sanketkedia Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes! I had just orchestrated one e2e flow of query_vectors and this code is not in that path hence haven't touched this yet. Also, there are plenty of other places where it is None for now but once I get more acquaintance with these code path, I'll add orchestration

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

@@ -51,7 +51,7 @@ where
{
async fn run(&self) {
let output = self.operator.run(&self.input).await;
let res = self.reply_channel.send(output).await;
let res = self.reply_channel.send(output, None).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious about this one. What information is missing if we do pass in the tracing context?

@@ -69,14 +70,24 @@ where
message = channel.recv() => {
match message {
Some(mut message) => {
message.handle(&mut self.handler,
&ComponentContext{
let parent_span: tracing::Span;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the code run if we turn off tracing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, does it add overhead even the tracing is not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By turning off tracing, do you mean setting a level less than TRACE?

@Ishiihara
Copy link
Contributor

@sanketkedia Thank you for working on this! One meta question I have is that is there a way to only use macros provided by the tracing library to add tracing in our code?

@Ishiihara
Copy link
Contributor

@sanketkedia Additional feedback:

  • Can we wrap the code we want to instrument into function and use #[instrument] instead?
  • Can we just pass the parent span in the send function?
  • Can we introduce a config to turn off tracing if needed?

@sanketkedia
Copy link
Contributor Author

sanketkedia commented Apr 16, 2024

Can we wrap the code we want to instrument into function and use #[instrument] instead?

@Ishiihara trying to understand this comment. Do you mean that:

Instead of future.instrument(span).await(), I do

#[instrument]
async fn FutureWithInstrumentation() {
  return future.await()
}

If yes then I think that in order to be able to pass in a parent span to #[instrument], I will have to wrap the function in a struct like:

// A struct that owns a span handle.
struct MyStruct
{
    span: tracing::Span
}

impl MyStruct
{
    // Use the struct's `span` field as the parent span
    #[instrument(parent = &self.span, skip(self))]
    async fn FutureWithInstrumentation(&self, future) {
      return future.await();
    }
}

Taking one step back - do you see any benefits of organizing the code in this manner v/s a simple .instrument()?

@sanketkedia
Copy link
Contributor Author

Can we introduce a config to turn off tracing if needed?

Can you elaborate a bit on this? Do we want to stop emitting spans and events completely if tracing is disabled? Do we want to stop propagating the parent span to components?

If we emit the spans and events say only at TRACE level and by default set the level to < TRACE (like INFO or DEBUG) then they shouldn't get emitted even with the current changes. The only overhead that will remain will be cloning and propagation of parent spans. So maybe the config could be something like DisableSpanPropagation and if set to true we don't propagate spans? Let me know your thoughts on this

@sanketkedia
Copy link
Contributor Author

sanketkedia commented Apr 16, 2024

is there a way to only use macros provided by the tracing library to add tracing in our code?

We would have to use https://docs.rs/tracing/latest/tracing/trait.Instrument.html#method.instrument to instrument async code. I don't think there's any macro to accomplish this.

Taking a step back - I don't know enough about Rust. Is there a reason macros are preferred in rust?

@sanketkedia
Copy link
Contributor Author

Can we just pass the parent span in the send function?

That's what I am doing currently. E.g.

 self.dispatcher
       .send(task, Some(Span::current().clone()))
       .await

Maybe I did not understand your question?

@@ -24,7 +24,7 @@ frontendService:
authCredentialsProvider: 'value: ""'
authzProvider: 'value: ""'
authzConfigProvider: 'value: ""'
memberlistProviderImpl: 'value: "chromadb.segment.impl.distributed.segment_directory.MockMemberlistProvider"'
memberlistProviderImpl: 'value: "chromadb.segment.impl.distributed.segment_directory.CustomResourceMemberlistProvider"'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we revert this test_logservice.py should pass

@Ishiihara
Copy link
Contributor

Can we introduce a config to turn off tracing if needed?

Can you elaborate a bit on this? Do we want to stop emitting spans and events completely if tracing is disabled? Do we want to stop propagating the parent span to components?

If we emit the spans and events say only at TRACE level and by default set the level to < TRACE (like INFO or DEBUG) then they shouldn't get emitted even with the current changes. The only overhead that will remain will be cloning and propagation of parent spans. So maybe the config could be something like DisableSpanPropagation and if set to true we don't propagate spans? Let me know your thoughts on this

Yes. Ideally, we would like to have no overhead if we set the level to < TRACE. We can have a follow up PR to address this.

@Ishiihara
Copy link
Contributor

Can we wrap the code we want to instrument into function and use #[instrument] instead?

@Ishiihara trying to understand this comment. Do you mean that:

Instead of future.instrument(span).await(), I do

#[instrument]
async fn FutureWithInstrumentation() {
  return future.await()
}

If yes then I think that in order to be able to pass in a parent span to #[instrument], I will have to wrap the function in a struct like:

// A struct that owns a span handle.
struct MyStruct
{
    span: tracing::Span
}

impl MyStruct
{
    // Use the struct's `span` field as the parent span
    #[instrument(parent = &self.span, skip(self))]
    async fn FutureWithInstrumentation(&self, future) {
      return future.await();
    }
}

Taking one step back - do you see any benefits of organizing the code in this manner v/s a simple .instrument()?

I was thinking it is less intrusive to the implementation if we can use some micro or some generic way of doing tracing. From a high level perspective, tracing can be thought of as a decorator and should be less intrusive.

@Ishiihara
Copy link
Contributor

Can we just pass the parent span in the send function?

That's what I am doing currently. E.g.

 self.dispatcher
       .send(task, Some(Span::current().clone()))
       .await

Maybe I did not understand your question?

Thank you! I may miss this during the review.

@Ishiihara Ishiihara self-requested a review April 19, 2024 23:19
@Ishiihara
Copy link
Contributor

Not suggesting we do the same, but something can be a reference https://github.com/risingwavelabs/risingwave/blob/main/src/common/src/util/tracing.rs

@HammadB HammadB merged commit adf011c into chroma-core:main Apr 23, 2024
119 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants