Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

observability: annotate Session+SessionPool events #1207

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

odeke-em
Copy link
Contributor

This change adds annotations for session and session pool events to aid customers in debugging latency issues with session pool malevolence and also for maintainers to figure out which session pool type is the most appropriate.

Updates #1170

@odeke-em odeke-em requested review from a team as code owners September 19, 2024 03:48
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Sep 19, 2024
@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/python-spanner API. label Sep 19, 2024
@odeke-em odeke-em force-pushed the trace-annotation-SessionPools branch 3 times, most recently from 8d6f2b7 to 4bc937e Compare September 19, 2024 07:37
@odeke-em
Copy link
Contributor Author

Exhibit

On running examples/grpc_instrumentation_enabled.py we get back these traces
Screenshot 2024-09-18 at 9 39 29 PM
Screenshot 2024-09-18 at 9 39 47 PM
Screenshot 2024-09-18 at 9 40 01 PM
Screenshot 2024-09-18 at 9 40 25 PM

@odeke-em
Copy link
Contributor Author

cc @harshachinta

@harshachinta harshachinta added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 19, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 19, 2024
@harshachinta harshachinta added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 19, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 19, 2024
@odeke-em odeke-em force-pushed the trace-annotation-SessionPools branch 2 times, most recently from f71d05c to eb9dd5a Compare September 20, 2024 01:27
@harshachinta harshachinta added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 20, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 20, 2024
This change adds annotations for session and session pool
events to aid customers in debugging latency issues with
session pool malevolence and also for maintainers to
figure out which session pool type is the most appropriate.

Updates googleapis#1170
This allows us to avoid the flakes with OpenTelemetry's
inMemorySpanExporter that might not have received the
exported spans yet.
@odeke-em
Copy link
Contributor Author

Kindly cc-ing you @harshachinta.

@harshachinta harshachinta added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 23, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 23, 2024
google/cloud/spanner_v1/pool.py Outdated Show resolved Hide resolved
google/cloud/spanner_v1/pool.py Show resolved Hide resolved
google/cloud/spanner_v1/pool.py Show resolved Hide resolved
google/cloud/spanner_v1/pool.py Outdated Show resolved Hide resolved
google/cloud/spanner_v1/pool.py Show resolved Hide resolved
start_time = time.time()
current_span = get_current_span()
current_span.add_event(
"Waiting for a session to become available", {"kind": "pinging_pool"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not "Acquiring a session"? Lets maintain span consistency across all the pools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because really it is waiting for a session to become available per

def get(self, timeout=None):
"""Check a session out from the pool.
:type timeout: int
:param timeout: seconds to block waiting for an available session
:rtype: :class:`~google.cloud.spanner_v1.session.Session`
:returns: an existing session from the pool, or a newly-created
session.
:raises: :exc:`queue.Empty` if the queue is empty.
"""
if timeout is None:
timeout = self.default_timeout
session = self._sessions.get(block=True, timeout=timeout)
if not session.exists():
session = self._database.session()
session.create()
return session

and in our document requirements there is a clear need to add that case
Screenshot 2024-10-05 at 10 43 59 PM

google/cloud/spanner_v1/session.py Outdated Show resolved Hide resolved
google/cloud/spanner_v1/session.py Outdated Show resolved Hide resolved
google/cloud/spanner_v1/session.py Outdated Show resolved Hide resolved
@odeke-em
Copy link
Contributor Author

odeke-em commented Oct 7, 2024

Kindly please take another look @harshachinta, feedback addressed, thank you for the code review!

raise
else:
span.set_status(Status(StatusCode.OK))


def set_span_error_and_record_exception(span, exc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not used anywhere else, we can revert it back as it was.

resp = api.batch_create_sessions(
request=request,
metadata=metadata,
with trace_call("CloudSpanner.BatchCreateSessions", self) as span:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like trace_call is not imported here.

from google.cloud.spanner_v1._opentelemetry_tracing import trace_call

)

if requested_session_count > 0:
current_span.add_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_span is not initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sheesh, sorry I had a bunch of benchmarks running locally and they interrupted my nox -s unit-3.8 tests so didn't see those but all fixed now.

@odeke-em odeke-em force-pushed the trace-annotation-SessionPools branch from 3c0c421 to 7cbbb0f Compare October 9, 2024 13:46
@harshachinta harshachinta added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 10, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 10, 2024


def get_current_span():
return trace.get_current_span()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid what happens here if opentelemetry is not installed.
trace will be None right? then all the places where this function will get invoked will get Null pointer exception.

)

if requested_session_count > 0:
span.add_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly span here could be None if opentelemetry is not installed (see

# Empty context manager. Users will have to check if the generated value is None or a span
).

In that case it can throw AttributeError: 'NoneType' object has no attribute 'add_event'

session = self._database.session()
session.create()
span_event_attributes["id"] = self._session_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here _session_id is not a propery of pool.
Test cases are already failing for this
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be session.session_id

session = self._sessions.get(block=True, timeout=timeout)
span_event_attributes["session.id"] = session.session_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here session.session_id can be empty, if session does not exist. Move this attribute addignment to after the if block below, as it ensures that session will be available.

session = self._database.session()
session.create()
span_event_attributes["id"] = self._session_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the attribute we are using for session id?
In some place above it was session.id, but here it is id. Can we follow the attribure naming consistently?


span_event_attributes["time.elapsed"] = time.time() - start_time
current_span.add_event("Acquired session", span_event_attributes)
Copy link
Contributor

@harshachinta harshachinta Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to have "Acquired Session" and then "Using Session".

Also I feel "Acquired Session" event should exist when a session lies in pool. If there is no session in pool then we are creating session and in that case "Acquired Session" does not make sense.

So the span event hierachy would something be like this.

If session exists in the pool, then

  1. Acquiring Session
  2. Waiting for a session to become available
  3. Acquired Session
  4. Using Session

If no session in the pool

  1. Acquiring Session
  2. Waiting for a session to become available
  3. No session available
  4. Creating Session
  5. CloudSpanner.CreateSession span
  6. Using Session

Does that sound good?
@surbhigarg92 Do you have any comments on this structure?

@@ -422,6 +480,18 @@ def bind(self, database):
session_template=Session(creator_role=self.database_role),
)

requested_session_count = request.session_count
Copy link
Contributor

@harshachinta harshachinta Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you not followed the BatchCreateSession span here as commented earlier somewhere above?

raise ValueError("No session ID set by back-end")
err_message = "No session ID set by back-end"
current_span = get_current_span()
current_span.add_event(err_message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this error not get captured by parent span? Do we need to add it as event here?

if self._session_id is not None:
raise ValueError("Session ID already set by back-end")
err_message = "Session ID already set by back-end"
current_span.add_event(err_message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here also, will this not be captured by current_span?

@@ -174,6 +175,43 @@ def test_create_w_database_role(self):
"CloudSpanner.CreateSession", attributes=TestSession.BASE_ATTRIBUTES
)

def test_create_session_span_annotations(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add unit tests to verify spans for different pool class similar to what is done for create_session?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/python-spanner API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants