Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish relative to PushMeterRegistry initialization time and align StepMeter boundaries to that #3450

Closed

Conversation

lenin-jaganathan
Copy link
Contributor

@lenin-jaganathan lenin-jaganathan commented Oct 4, 2022

This PR tries to solve this issue (#2818). The start time is captured when the push meter registry is initialized. For Step based meters, this is passed to the Step Value and StepTupple to align them to the registry start time and not the step start time (see - #1218).

TODO:

  • rebase and target 1.8.x
  • investigate test failures with the alignToEpoch config flipped

@pivotal-cla
Copy link

@lenin-jaganathan Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

@pivotal-cla
Copy link

@lenin-jaganathan Thank you for signing the Contributor License Agreement!

@sonatype-lift
Copy link
Contributor

sonatype-lift bot commented Oct 4, 2022

⚠️ 10 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

@jamesmartinpp

This comment was marked as resolved.

@shakuzen

This comment was marked as resolved.

@jamesmartinpp

This comment was marked as resolved.

@shakuzen shakuzen changed the title Publish at registry start time Publish relative to PushMeterRegistry initialization time and align StepMeter boundaries to that Oct 7, 2022
Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good except the initial scheduling delay doesn't look right to me, and I would like to add tests to verify that logic. I will try to improve the naming of some things. And I would like to switch this new behavior to be the default, with a config option to go back to the previous behavior. I will follow-up with commits for that. And I'll rebase against and target the PR at 1.8.x.
Thank you for the contribution.

@@ -77,12 +80,18 @@ public void start(ThreadFactory threadFactory) {
scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(threadFactory);
// time publication to happen just after StepValue finishes the step
long stepMillis = config.step().toMillis();
long initialDelayMillis = stepMillis - (clock.wallTime() % stepMillis) + 1;
long initialDelayMillis = config.publishAtStep() ? stepMillis - (clock.wallTime() % stepMillis) + 1
: (stepMillis - registryStartMillis) + 1;
Copy link
Member

@shakuzen shakuzen Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look correct to me. Since the meters' step will be shifted by registryStartMillis shouldn't the initial delay be stepMillis + 1 to publish after one step interval, regardless of the epoch interval? Actually, it is more complicated than that for the case when the registry is stopped and started or if there is at least a millisecond elapsed between initialization and scheduling. For a generic start time, I think it would be:
(edit: updated) stepMillis - ((clock.wallTime() - registryStartMillis) % stepMillis) + 1
Except this also doesn't work if the registry is restarted in between an epoch step boundary and a meter step boundary - it will be stepMillis too long. Probably there is a simpler way to capture that in an expression that I'm overlooking, but I have a commit that works per my testing.

@shakuzen shakuzen self-requested a review October 7, 2022 18:11
@shakuzen
Copy link
Member

shakuzen commented Oct 7, 2022

After flipping the default, there is a test failure. I've run out of time to dig into it, and I'll be on vacation next week. I can come back to this once I'm back from vacation, but feel free to look into the failure in the meantime.

@lenin-jaganathan
Copy link
Contributor Author

@shakuzen If the default behavior is made to publish relative to the start time, a few tests will fail because of the way they were written. I will try to address those tests.

@lenin-jaganathan
Copy link
Contributor Author

lenin-jaganathan commented Oct 18, 2022

@shakuzen Addressed the test failures in SignalFx library.

@lenin-jaganathan
Copy link
Contributor Author

Added an additional commit to make the metrics timestamp synchronized across instances(report timestamp as the start of the current step) for SignalFx Meter Registry.

@shakuzen shakuzen linked an issue Nov 4, 2022 that may be closed by this pull request
@@ -104,7 +104,7 @@ else if ("https".equals(apiUri.getScheme())) {

@Override
protected void publish() {
final long timestamp = clock.wallTime();
final long timestamp = (clock.wallTime() / config.step().toMillis()) * config.step().toMillis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be the epoch-based step boundary. Depending on the configuration, that will not be correct (such as the default configuration). I have a fix for it locally I will push.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more about it, I'm wondering why we would change this. Before this change it is sending the timestamp as the time when publishing starts. Why would we change it to the start of the step with this change? That seems like a separate decision to make not directly related to the change of this pull request. I think I'll revert this change. Please open a separate issue or pull request to consider making this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, the reporting was done at the top of the step (for simplicity let's assume 1min step). And if I understand right all the instances will report almost similar timestamps (i,e start of the minute). With this change, the timestamp will be de-synchronized, it will no more be the start of the minute, what I tried to do was to make it to the top of the minute.

Because when sending data from multiple instances without instanceId (or a unique identifier for an instance), the back-end of SignalFx might be able to calculate what is step a particular metric is reported at. Having said that, I will follow-up on this with a different PR after confirming with someone from SignalFx confirm this.

@@ -457,13 +456,13 @@ void shouldNotExportCumulativeHistogramDataByDefault_Timer() {
mockClock.add(config.step().minus(Duration.ofMillis(1)));

assertThat(getDataPoints(registry, mockClock.wallTime())).hasSize(8)
.has(gaugePoint("my.timer.avg", 0.2525), atIndex(0)).has(counterPoint("my.timer.count", 2), atIndex(1))
.has(gaugePoint("my.timer.avg", 2.525), atIndex(0)).has(counterPoint("my.timer.count", 2), atIndex(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review comment about investigating the cause of the failures meant understanding exactly how the behavior changed and make sure the change and test made sense. This appears to change the values so the test passes, but I don't think it's demonstrating what the test intended. We need to be clear about what changed and if that's acceptable. I'll write up a summary in another comment.

When investigating this, I found a bug in the assertion logic for hasValue which caused it to always return true for double values that would have an int value of 0 (like 0.2525). I'll fix that in a separate issue and rebase this pull request on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out hasValue had a bug all the time and the tests would have failed if it is right. And the other issue is with the TimeWindowFixedBoundaryHistogram and buffer length. That is the whole reason SignalFx Registry went for a Delta Histogram Implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the tests back to assert what they did before as I think that's what the intention of the tests was. I've had to change the test to record later than the beginning of the step to avoid the histogram counts being rotated out when checking the published data. That whole problem is a preexisting issue that was worked around in the tests by checking the published data at a specific millisecond before the histogram would rotate but after the step-based values (count/sum/avg) would return the desired new values. We can consider if the behavior of histogram rotation is best or not separately from this pull request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@lenin-jaganathan
Copy link
Contributor Author

@shakuzen Do you want me to address anything more on this PR?

* @return how many milliseconds publishing is offset from Unix Epoch-based step
* intervals
*/
protected long getPushOffsetFromEpochStepMillis() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name of this method and added a JavaDoc in hopes of making things more clear.

* @return false if publishing should be scheduled relative to registry instantiation
* time. Default is {@code false} to avoid the documented resource exhaustion issue.
*/
default boolean isPushAlignedGlobally() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably spend endless time coming up with better and better names, but here's a third iteration where I've avoided using "epoch" in the name because I think it will be less clear what this does compared to focusing on the effect of pushes being aligned globally if you set this to true. Anyone who has feedback about the naming, feel free to share (before we release it, ideally).

I've updated the JavaDoc to explain the issue that motivated this configuration being introduced and the reason for its default.

@shakuzen
Copy link
Member

Do you want me to address anything more on this PR?

I don't think so. I'm asking for some review internally. Once that is done, we should be good to go. If you have any thoughts on naming as I've recently updated, please share. Or if you have thoughts on the below.

On how we'll apply this where, we'll keep the default for the configuration false when merging this to main for the next feature release (1.11), which makes the publishing happen not at the same time for every instance. Out of an abundance of caution, now I would like to introduce this in patch releases (1.8.x, 1.9.x, 1.10.x) but initially with the default to true so there is no behavior change unless a user overrides the default. If we get some feedback on usage of this in production, we could consider flipping the default in a patch release later, as the current behavior is very problematic in large deployments. Until we get some feedback on usage, I'm not comfortable making the change in a patch release, even with an option for users to opt out of it. Another thing we're considering is making the configuration option deprecated in 1.11 because as long as there is no problem with the behavior, I'd rather not have a configuration toggle for it going forward (we didn't have one until now because we wanted consistent behavior and there wasn't a need for users to be able to choose). At some point in the future, we could consider removing the configuration altogether.

ljaganathan and others added 6 commits November 17, 2022 16:22
The publishing should not happen relative to the epoch step with publishAtStep set to false. Rather it should happen relative to the meters' step, which is offset from the epoch steps by `registryStartMillis`.
Renames variables/methods with the intention to improve clarity and readability. Edits the description of the new config method JavaDoc.
We know the previous behavior causes issues with many instances running, which makes it not a good default. We will leave this configuration for now so that users have a way to opt out of the new behavior if it causes unforeseen issues for them.
Adds a warning to the JavaDoc about the side effect of aligning to the Epoch.
ljaganathan and others added 8 commits November 17, 2022 16:22
The previous test was flaky and relied on system-clock timings. This is replaced with a more direct test of calculating the delay by extracting the logic to a package-private method.
We can consider such a change separate from this changeset.
Record in the middle of the step to avoid histogram counts being rotated out by the time of the simulated publication in the affected tests.
This updates names to hopefully make more clear their usage, while also polishing the JavaDocs to give more background.
@lenin-jaganathan
Copy link
Contributor Author

Putting this in draft mode. Will close this PR in a week. For additional info see the linked issue for why this fix might not be desired.

@lenin-jaganathan lenin-jaganathan marked this pull request as draft April 6, 2023 03:34
@lenin-jaganathan
Copy link
Contributor Author

Fixed by #3750

@lenin-jaganathan lenin-jaganathan deleted the upstream_main branch April 29, 2023 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid global publishing alignment
4 participants