[ENH] Integrate e2e tracing for query vectors endpoint #1991

sanketkedia · 2024-04-09T20:40:26Z

Description of changes

This PR integrates end-to-end tracing for the Query Service (specifically the query_vectors() RPC). The main highlights are:

Span inheritance by component - whenever a component is started the executor executes the run() function inside a child span with the current span as the parent. This works even across threads now for e.g. if the component was spawned in inherited mode but it invoked the receive msg handler on a different thread, etc.
Span propagation across components. For e.g. when the HNSW Query orchestrator submits the Pull logs request or brute force KNN request to the worker (via the dispatcher), the worker invokes the task handler inside a child span with parent span as the HNSW orchestrator. This is implemented by adding a new field to the Task struct - the span id of the parent and the worker then creates a child span with this id as the parent.

Test plan

[+] Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Check https://gist.github.com/sanketkedia/96f3b00a2037a0fb2e99c16c3a379e69 for the exact logs for query_vectors() rpc

Documentation Changes

None required

github-actions · 2024-04-09T20:40:38Z

rust/worker/src/execution/operator.rs

HammadB · 2024-04-09T22:01:36Z

rust/worker/src/bin/query_service.rs

@@ -2,5 +2,8 @@ use worker::query_service_entrypoint;

 #[tokio::main]
 async fn main() {
+    tracing_subscriber::fmt()


Interesting, was this the fix for that issue? What is it by default.

@nicolasgere @beggers - how do you think we should handle the log level config?

It is definitely less than info level since info level logs were not getting printed

how do you think we should handle the log level config?

Could you be more specific? Like, how should we configure it, how should we propagate it, how should we think about different logging levels, something else?

How we configure it and how we use the levels

rust/worker/src/execution/operator.rs

rust/worker/src/execution/worker_thread.rs

HammadB · 2024-04-15T19:53:37Z

rust/worker/src/execution/dispatcher.rs

-            Some(channel) => match channel.reply_to.send(task).await {
+            Some(channel) => match channel
+                .reply_to
+                .send(task, Some(Span::current().clone()))


out of curiosity - is cloning spans the only option here?

Yeah, I gave this some thought. I was running into an issue where the dispatcher would pass in a span to the worker and by the time the worker executed with this span as a child, the parent span gets GC'd because the RAII guard that the dispatcher creates goes out of scope. So, we need some ref counting at the very least here.

I thought of trying out Arc then but the problem with that is the genesis span that I instrument at the beginning of the query_vectors RPC in query_service is created automatically by tracing library (since I instrument it with the #instrument wrapper) so I can't create an Arc for that. Hence, cloning.

HammadB · 2024-04-15T19:54:17Z

rust/worker/src/execution/operator.rs

@@ -51,7 +51,7 @@ where
 {
    async fn run(&self) {
        let output = self.operator.run(&self.input).await;
-        let res = self.reply_channel.send(output).await;
+        let res = self.reply_channel.send(output, None).await;


do we want to propagate a context here?

Also curious about this one. What information is missing if we do pass in the tracing context?

My thought process here was that this sends a reply back to the original component that had asked to execute this task. And that component should already be executing inside a span (after all that is how it passed its context to the worker which is now sending a reply).

Maybe there are patterns other than this for which this needs a non None value but I can come back to those once I build more context of the system?

HammadB · 2024-04-15T19:54:48Z

rust/worker/src/execution/orchestration/compact.rs

@@ -106,7 +106,7 @@ impl CompactOrchestrator {
        };
        let input = PullLogsInput::new(collection_id, 0, 100, None, Some(end_timestamp));
        let task = wrap(operator, input, self_address);
-        match self.dispatcher.send(task).await {
+        match self.dispatcher.send(task, None).await {


would we want to propagate here?

Ah, yes! I had just orchestrated one e2e flow of query_vectors and this code is not in that path hence haven't touched this yet. Also, there are plenty of other places where it is None for now but once I get more acquaintance with these code path, I'll add orchestration

HammadB

This is great, thanks!

Ishiihara · 2024-04-15T21:25:03Z

rust/worker/src/execution/operator.rs

@@ -51,7 +51,7 @@ where
 {
    async fn run(&self) {
        let output = self.operator.run(&self.input).await;
-        let res = self.reply_channel.send(output).await;
+        let res = self.reply_channel.send(output, None).await;


Also curious about this one. What information is missing if we do pass in the tracing context?

Ishiihara · 2024-04-15T21:45:43Z

rust/worker/src/system/executor.rs

@@ -69,14 +70,24 @@ where
                message = channel.recv() => {
                    match message {
                        Some(mut message) => {
-                            message.handle(&mut self.handler,
-                                &ComponentContext{
+                            let parent_span: tracing::Span;


Would the code run if we turn off tracing?

If not, does it add overhead even the tracing is not needed?

By turning off tracing, do you mean setting a level less than TRACE?

Ishiihara · 2024-04-15T21:59:20Z

@sanketkedia Thank you for working on this! One meta question I have is that is there a way to only use macros provided by the tracing library to add tracing in our code?

Ishiihara · 2024-04-15T22:22:31Z

@sanketkedia Additional feedback:

Can we wrap the code we want to instrument into function and use #[instrument] instead?
Can we just pass the parent span in the send function?
Can we introduce a config to turn off tracing if needed?

sanketkedia · 2024-04-16T07:56:41Z

Can we wrap the code we want to instrument into function and use #[instrument] instead?

@Ishiihara trying to understand this comment. Do you mean that:

Instead of future.instrument(span).await(), I do

#[instrument]
async fn FutureWithInstrumentation() {
  return future.await()
}

If yes then I think that in order to be able to pass in a parent span to #[instrument], I will have to wrap the function in a struct like:

// A struct that owns a span handle.
struct MyStruct
{
    span: tracing::Span
}

impl MyStruct
{
    // Use the struct's `span` field as the parent span
    #[instrument(parent = &self.span, skip(self))]
    async fn FutureWithInstrumentation(&self, future) {
      return future.await();
    }
}

Taking one step back - do you see any benefits of organizing the code in this manner v/s a simple .instrument()?

sanketkedia · 2024-04-16T08:07:35Z

Can we introduce a config to turn off tracing if needed?

Can you elaborate a bit on this? Do we want to stop emitting spans and events completely if tracing is disabled? Do we want to stop propagating the parent span to components?

If we emit the spans and events say only at TRACE level and by default set the level to < TRACE (like INFO or DEBUG) then they shouldn't get emitted even with the current changes. The only overhead that will remain will be cloning and propagation of parent spans. So maybe the config could be something like DisableSpanPropagation and if set to true we don't propagate spans? Let me know your thoughts on this

sanketkedia · 2024-04-16T08:09:57Z

is there a way to only use macros provided by the tracing library to add tracing in our code?

We would have to use https://docs.rs/tracing/latest/tracing/trait.Instrument.html#method.instrument to instrument async code. I don't think there's any macro to accomplish this.

Taking a step back - I don't know enough about Rust. Is there a reason macros are preferred in rust?

sanketkedia · 2024-04-16T08:13:44Z

Can we just pass the parent span in the send function?

That's what I am doing currently. E.g.

 self.dispatcher
       .send(task, Some(Span::current().clone()))
       .await

Maybe I did not understand your question?

HammadB · 2024-04-17T17:17:23Z

k8s/distributed-chroma/values.yaml

@@ -24,7 +24,7 @@ frontendService:
  authCredentialsProvider: 'value: ""'
  authzProvider: 'value: ""'
  authzConfigProvider: 'value: ""'
-  memberlistProviderImpl: 'value: "chromadb.segment.impl.distributed.segment_directory.MockMemberlistProvider"'
+  memberlistProviderImpl: 'value: "chromadb.segment.impl.distributed.segment_directory.CustomResourceMemberlistProvider"'


If we revert this test_logservice.py should pass

Ishiihara · 2024-04-19T23:13:21Z

Can we introduce a config to turn off tracing if needed?

Can you elaborate a bit on this? Do we want to stop emitting spans and events completely if tracing is disabled? Do we want to stop propagating the parent span to components?

If we emit the spans and events say only at TRACE level and by default set the level to < TRACE (like INFO or DEBUG) then they shouldn't get emitted even with the current changes. The only overhead that will remain will be cloning and propagation of parent spans. So maybe the config could be something like DisableSpanPropagation and if set to true we don't propagate spans? Let me know your thoughts on this

Yes. Ideally, we would like to have no overhead if we set the level to < TRACE. We can have a follow up PR to address this.

Ishiihara · 2024-04-19T23:18:21Z

Can we wrap the code we want to instrument into function and use #[instrument] instead?

@Ishiihara trying to understand this comment. Do you mean that:

Instead of future.instrument(span).await(), I do
#[instrument]
async fn FutureWithInstrumentation() {
  return future.await()
}
If yes then I think that in order to be able to pass in a parent span to #[instrument], I will have to wrap the function in a struct like:
// A struct that owns a span handle.
struct MyStruct
{
    span: tracing::Span
}

impl MyStruct
{
    // Use the struct's `span` field as the parent span
    #[instrument(parent = &self.span, skip(self))]
    async fn FutureWithInstrumentation(&self, future) {
      return future.await();
    }
}
Taking one step back - do you see any benefits of organizing the code in this manner v/s a simple .instrument()?

I was thinking it is less intrusive to the implementation if we can use some micro or some generic way of doing tracing. From a high level perspective, tracing can be thought of as a decorator and should be less intrusive.

Ishiihara · 2024-04-19T23:18:52Z

Can we just pass the parent span in the send function?

That's what I am doing currently. E.g.
 self.dispatcher
       .send(task, Some(Span::current().clone()))
       .await
Maybe I did not understand your question?

Thank you! I may miss this during the review.

Ishiihara · 2024-04-19T23:35:32Z

Not suggesting we do the same, but something can be a reference https://github.com/risingwavelabs/risingwave/blob/main/src/common/src/util/tracing.rs

HammadB reviewed Apr 9, 2024

View reviewed changes

rust/worker/src/execution/operator.rs Outdated Show resolved Hide resolved

HammadB reviewed Apr 9, 2024

View reviewed changes

rust/worker/src/execution/operator.rs Outdated Show resolved Hide resolved

HammadB reviewed Apr 9, 2024

View reviewed changes

rust/worker/src/execution/worker_thread.rs Outdated Show resolved Hide resolved

sanketkedia changed the title ~~Integrate e2e tracing for query vectors endpoint~~ [ENH] Integrate e2e tracing for query vectors endpoint Apr 13, 2024

HammadB reviewed Apr 15, 2024

View reviewed changes

HammadB approved these changes Apr 15, 2024

View reviewed changes

sanketkedia force-pushed the integrate_tracing branch from 7ca1612 to 4a63b61 Compare April 15, 2024 21:42

Ishiihara reviewed Apr 15, 2024

View reviewed changes

HammadB reviewed Apr 17, 2024

View reviewed changes

skedia added 5 commits April 18, 2024 08:28

Integrate tracing for query vectors

a7398ed

Review comments

bb2f70e

Fix build

1d2b944

Revert member list change

49c4268

Change span and event level to trace

5d603c2

sanketkedia force-pushed the integrate_tracing branch from c21282d to 5d603c2 Compare April 18, 2024 16:02

Ishiihara self-requested a review April 19, 2024 23:19

Ishiihara approved these changes Apr 19, 2024

View reviewed changes

HammadB merged commit adf011c into chroma-core:main Apr 23, 2024
119 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Integrate e2e tracing for query vectors endpoint #1991

[ENH] Integrate e2e tracing for query vectors endpoint #1991

sanketkedia commented Apr 9, 2024

github-actions bot commented Apr 9, 2024

HammadB Apr 9, 2024

sanketkedia Apr 13, 2024

beggers Apr 15, 2024

HammadB Apr 15, 2024

HammadB Apr 15, 2024 •

edited

Loading

sanketkedia Apr 15, 2024

HammadB Apr 15, 2024

Ishiihara Apr 15, 2024

sanketkedia Apr 15, 2024

HammadB Apr 15, 2024

sanketkedia Apr 15, 2024 •

edited

Loading

HammadB left a comment

Ishiihara Apr 15, 2024

Ishiihara Apr 15, 2024

Ishiihara Apr 15, 2024

sanketkedia Apr 16, 2024

Ishiihara commented Apr 15, 2024

Ishiihara commented Apr 15, 2024

sanketkedia commented Apr 16, 2024 •

edited

Loading

sanketkedia commented Apr 16, 2024

sanketkedia commented Apr 16, 2024 •

edited

Loading

sanketkedia commented Apr 16, 2024

HammadB Apr 17, 2024

Ishiihara commented Apr 19, 2024

Ishiihara commented Apr 19, 2024

Ishiihara commented Apr 19, 2024

Ishiihara commented Apr 19, 2024

[ENH] Integrate e2e tracing for query vectors endpoint #1991

[ENH] Integrate e2e tracing for query vectors endpoint #1991

Conversation

sanketkedia commented Apr 9, 2024

Description of changes

Test plan

Documentation Changes

github-actions bot commented Apr 9, 2024

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HammadB Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanketkedia Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

HammadB left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ishiihara commented Apr 15, 2024

Ishiihara commented Apr 15, 2024

sanketkedia commented Apr 16, 2024 • edited Loading

sanketkedia commented Apr 16, 2024

sanketkedia commented Apr 16, 2024 • edited Loading

sanketkedia commented Apr 16, 2024

Choose a reason for hiding this comment

Ishiihara commented Apr 19, 2024

Ishiihara commented Apr 19, 2024

Ishiihara commented Apr 19, 2024

Ishiihara commented Apr 19, 2024

HammadB Apr 15, 2024 •

edited

Loading

sanketkedia Apr 15, 2024 •

edited

Loading

sanketkedia commented Apr 16, 2024 •

edited

Loading

sanketkedia commented Apr 16, 2024 •

edited

Loading