Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix closed writer panic by resetting buffers/writer correctly on error path #1379

Merged
merged 4 commits into from
Apr 19, 2022

Conversation

annanay25
Copy link
Contributor

@annanay25 annanay25 commented Apr 13, 2022

Signed-off-by: Annanay annanayagarwal@gmail.com

What this PR does:
Refer to a longer explanation in this comment: #1374 (comment)

There is a small section of code in our data writer where the compression writer is closed but not reset. If there are any errors like writing to disk / transient errors with kubernetes pvcs, the function could return without resetting the compression writer. This PR changes that to put the reset logic in a defer func so it triggers whenever the return is triggered.

I understand that working with defer funcs in this manner makes it tricky to follow the logic, so I'm open to alternate implementation methods.

Edit: Offline discussions confirmed my fears as mentioned above :) and so went ahead with a different strategy.

cc @jvilhuber

Which issue(s) this PR fixes:
Fixes the red herring error message in #1374. But we still need to fix the fact that we drop traces without warning when there are issues writing to disk.

Checklist

  • N/A Tests updated
  • N/A Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Annanay <annanayagarwal@gmail.com>
@joe-elliott
Copy link
Member

Is there an actual panic here? Are we just looking to guarantee we do all cleanup work even if marshalPageToWriter fails?

@annanay25
Copy link
Contributor Author

Nope theres no panic. We just want to make sure ingesters function normally after a transient failure like disk full.

@joe-elliott
Copy link
Member

joe-elliott commented Apr 13, 2022

Frankly if the disk is full it might make sense to actually panic. This is a pretty dire failure mode for Tempo as ingested data will be lost.

With this change how will the ingester behave if the disk is persistently full?

@annanay25
Copy link
Contributor Author

annanay25 commented Apr 15, 2022

Yes I agree. If the disk is persistently full the ingester will continue to try and write but fail at every attempt till the disk is replaced. Not sure if its worth bubbling that error up and fail on the write path in which case it will trigger some ingester write alert. But as of now the failure to write to disk will go undetected unless someone checks the logs when no more blocks are being created.

Does it then make sense to panic if there's a error in these lines:
https://github.com/grafana/tempo/pull/1379/files#diff-5925637787bab0225dab0a4ab2bcf81bfdd3a6c841d051759b11eae011e37067R75-R78

I believe that should do the trick, WDYT?

Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
Signed-off-by: Annanay <annanayagarwal@gmail.com>
@annanay25 annanay25 changed the title Fix closed writer panic by resetting buffers/writer in a defer func Fix closed writer panic by resetting buffers/writer correctly on error path Apr 19, 2022
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Looks good.

@annanay25 annanay25 merged commit 214ceac into grafana:main Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants