-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Integrate e2e tracing for query vectors endpoint #1991
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
@@ -2,5 +2,8 @@ use worker::query_service_entrypoint; | |||
|
|||
#[tokio::main] | |||
async fn main() { | |||
tracing_subscriber::fmt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, was this the fix for that issue? What is it by default.
@nicolasgere @beggers - how do you think we should handle the log level config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely less than info level since info level logs were not getting printed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you think we should handle the log level config?
Could you be more specific? Like, how should we configure it, how should we propagate it, how should we think about different logging levels, something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we configure it and how we use the levels
Some(channel) => match channel.reply_to.send(task).await { | ||
Some(channel) => match channel | ||
.reply_to | ||
.send(task, Some(Span::current().clone())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity - is cloning spans the only option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I gave this some thought. I was running into an issue where the dispatcher would pass in a span to the worker and by the time the worker executed with this span as a child, the parent span gets GC'd because the RAII guard that the dispatcher creates goes out of scope. So, we need some ref counting at the very least here.
I thought of trying out Arc
then but the problem with that is the genesis span that I instrument at the beginning of the query_vectors RPC in query_service is created automatically by tracing library (since I instrument it with the #instrument
wrapper) so I can't create an Arc
for that. Hence, cloning.
@@ -51,7 +51,7 @@ where | |||
{ | |||
async fn run(&self) { | |||
let output = self.operator.run(&self.input).await; | |||
let res = self.reply_channel.send(output).await; | |||
let res = self.reply_channel.send(output, None).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to propagate a context here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also curious about this one. What information is missing if we do pass in the tracing context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought process here was that this sends a reply back to the original component that had asked to execute this task. And that component should already be executing inside a span (after all that is how it passed its context to the worker which is now sending a reply).
Maybe there are patterns other than this for which this needs a non None
value but I can come back to those once I build more context of the system?
@@ -106,7 +106,7 @@ impl CompactOrchestrator { | |||
}; | |||
let input = PullLogsInput::new(collection_id, 0, 100, None, Some(end_timestamp)); | |||
let task = wrap(operator, input, self_address); | |||
match self.dispatcher.send(task).await { | |||
match self.dispatcher.send(task, None).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we want to propagate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes! I had just orchestrated one e2e flow of query_vectors
and this code is not in that path hence haven't touched this yet. Also, there are plenty of other places where it is None
for now but once I get more acquaintance with these code path, I'll add orchestration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks!
7ca1612
to
4a63b61
Compare
@@ -51,7 +51,7 @@ where | |||
{ | |||
async fn run(&self) { | |||
let output = self.operator.run(&self.input).await; | |||
let res = self.reply_channel.send(output).await; | |||
let res = self.reply_channel.send(output, None).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also curious about this one. What information is missing if we do pass in the tracing context?
@@ -69,14 +70,24 @@ where | |||
message = channel.recv() => { | |||
match message { | |||
Some(mut message) => { | |||
message.handle(&mut self.handler, | |||
&ComponentContext{ | |||
let parent_span: tracing::Span; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the code run if we turn off tracing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, does it add overhead even the tracing is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By turning off tracing, do you mean setting a level less than TRACE
?
@sanketkedia Thank you for working on this! One meta question I have is that is there a way to only use macros provided by the tracing library to add tracing in our code? |
@sanketkedia Additional feedback:
|
@Ishiihara trying to understand this comment. Do you mean that: Instead of
If yes then I think that in order to be able to pass in a parent span to
Taking one step back - do you see any benefits of organizing the code in this manner v/s a simple |
Can you elaborate a bit on this? Do we want to stop emitting spans and events completely if tracing is disabled? Do we want to stop propagating the parent span to components? If we emit the spans and events say only at TRACE level and by default set the level to < TRACE (like INFO or DEBUG) then they shouldn't get emitted even with the current changes. The only overhead that will remain will be cloning and propagation of parent spans. So maybe the config could be something like |
We would have to use https://docs.rs/tracing/latest/tracing/trait.Instrument.html#method.instrument to instrument async code. I don't think there's any macro to accomplish this. Taking a step back - I don't know enough about Rust. Is there a reason macros are preferred in rust? |
That's what I am doing currently. E.g.
Maybe I did not understand your question? |
k8s/distributed-chroma/values.yaml
Outdated
@@ -24,7 +24,7 @@ frontendService: | |||
authCredentialsProvider: 'value: ""' | |||
authzProvider: 'value: ""' | |||
authzConfigProvider: 'value: ""' | |||
memberlistProviderImpl: 'value: "chromadb.segment.impl.distributed.segment_directory.MockMemberlistProvider"' | |||
memberlistProviderImpl: 'value: "chromadb.segment.impl.distributed.segment_directory.CustomResourceMemberlistProvider"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we revert this test_logservice.py should pass
c21282d
to
5d603c2
Compare
Yes. Ideally, we would like to have no overhead if we set the level to < TRACE. We can have a follow up PR to address this. |
I was thinking it is less intrusive to the implementation if we can use some micro or some generic way of doing tracing. From a high level perspective, tracing can be thought of as a decorator and should be less intrusive. |
Thank you! I may miss this during the review. |
Not suggesting we do the same, but something can be a reference https://github.com/risingwavelabs/risingwave/blob/main/src/common/src/util/tracing.rs |
Description of changes
This PR integrates end-to-end tracing for the Query Service (specifically the
query_vectors()
RPC). The main highlights are:run()
function inside a child span with the current span as the parent. This works even across threads now for e.g. if the component was spawned ininherited
mode but it invoked the receive msg handler on a different thread, etc.Task
struct - the span id of the parent and the worker then creates a child span with this id as the parent.Test plan
pytest
for python,yarn test
for js,cargo test
for rustCheck https://gist.github.com/sanketkedia/96f3b00a2037a0fb2e99c16c3a379e69 for the exact logs for
query_vectors()
rpcDocumentation Changes
None required