Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for vulture sending long running traces #951

Merged
merged 5 commits into from
Sep 22, 2021

Conversation

zalegrala
Copy link
Contributor

@zalegrala zalegrala commented Sep 13, 2021

What this PR does:

Here we implement the ability for the vulture to send long-running traces, which will enable the vulture verify additional infrastructure components required to resolve the trace.

Which issue(s) this PR fixes:
Fixes #791

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zalegrala
Copy link
Contributor Author

This will require #944

@zalegrala zalegrala force-pushed the vultureLongWrite branch 3 times, most recently from 75a3c00 to 3c21291 Compare September 15, 2021 17:57
@zalegrala
Copy link
Contributor Author

zalegrala commented Sep 15, 2021

This is working pretty good locally. There are two conditions that could result in false positives that I know about currently.

The first is that the number of bytes on a trace may exceed the allowed limit of Tempo, which is hard to know unless we query the API to know the size. But perhaps some work on that API should precede that so we can work with json. In lieu of working on the API and reading the configured value from Tempo, I've reduced the chance of extending a trace to include future amendments to a given trace. This seems like maybe a reasonable approach, since for each extension of a trace, the chance of extending it yet again continues to diminish.

The second is that when we go to read what we expect is the result, we are effectively counting the number of writes that we expect has taken place. There is a margin of error, where if the timing worked out just perfect, the incorrectresult ticker would rise falsely. I've added a dirty little sleep that should reduce the chance of it happening if it seems like we are within that margin of error, but I don't feel great about it, but it probably works fine.

I'll leave it running locally for a while and see where we're at.

@zalegrala zalegrala force-pushed the vultureLongWrite branch 2 times, most recently from 11ff31c to 2688012 Compare September 15, 2021 20:09
@zalegrala zalegrala marked this pull request as ready for review September 15, 2021 21:38
zap.Int64("seed", info.timestamp.Unix()),
)

if maybe(info.r) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like to look at a better way to do this. perhaps we could generate a "total batches" int from the random number generator and store it in traceinfo. then we could decrement that value each time we emitted a batch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I think that cleans up the readability a bit. I've made the update.

// If the last write has happened very recently, we'll wait a bit to make
// sure the write has taken place, and then add the batches that would have
// just been written.
if time.Since(lastWrite) < (1 * time.Second) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty janky.

from a seed we should be able to determine when a trace was completed and either query or not query the trace if we feel it is not complete.

putting sleeps here is going to weirdly impact the timings of other loops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've washed off some of that jank. No more sleeps to construct the expected trace, and we only search/query when we expect that all writes have taken place.

// that we get the expected number of batches on a trace. A value larger
// than 25 here results in vulture writing traces that exceed the maximum
// trace size.
batchHighWaterMark int64 = 25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename:

maxBatchesPerWrite
maxLongWritesPerTrace

@joe-elliott joe-elliott merged commit 2498d5b into grafana:main Sep 22, 2021
@zalegrala zalegrala deleted the vultureLongWrite branch September 23, 2021 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend vulture to generate long-running traces
2 participants