Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestBallastMemory randomly fails still, even with the old fix #6535

Closed
boostchicken opened this issue Dec 6, 2021 · 9 comments · Fixed by #6691 or #8789
Closed

TestBallastMemory randomly fails still, even with the old fix #6535

boostchicken opened this issue Dec 6, 2021 · 9 comments · Fixed by #6691 or #8789
Assignees
Labels
bug Something isn't working ci-cd CI, CD, testing, build issues

Comments

@boostchicken
Copy link
Member

boostchicken commented Dec 6, 2021

Describe the bug
TestMemoryBallast fails. The output of the test failure is meaningless as well. It does not always fail, but I just saw one fail in a pull request because the functionality we wrote to "give it some leeway" said "I expected nothing higher than 88, the value was 89" and it failed the whole PR.

Steps to reproduce
If possible, provide a recipe for reproducing the error.
Just make a PR, run the testbed, it will eventually fail.
What did you expect to see?
A clear and concise description of what you expected to see.
I expect to see information on what we are really testing here, and what do the values mean. Yes 89 is higher than 88, I sincerely doubt that thats a problem of memory usage.

We should probably look and test and see if Memory values are a deviation off after multiple runs of a piece of a code, right now it just seems like it give its it an extra 10% leeway. That might hide real issues as well. I'd like to discuss the test here a little bit, what it does, why its there and find a real solution to this issue.

If any test fails, it should be able to convey to me a path to resolution. If a contributor saw the memory ballast was impacted by their changes and this test failed there is no detail on what one might do to go about remediating this.
What did you see instead?
A clear and concise description of what you saw instead.
That some extra memory was being used for some reason, very little, but more than we expected
What version did you use?
Version: (e.g., v0.4.0, 1eb551b, etc)
Any build
What config did you use?
Config: (e.g. the yaml config file)
Any
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
Any
Additional context
Old issue here
open-telemetry/opentelemetry-collector#3233

@boostchicken boostchicken added the bug Something isn't working label Dec 6, 2021
@boostchicken
Copy link
Member Author

PS. Happy to be part of the re-write of this

@boostchicken
Copy link
Member Author

The log from one of my runs

=== RUN   TestBallastMemory
2021/12/06 14:17:47 Starting Agent (/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/bin/otelcontribcol_linux_amd64)
2021/12/06 14:17:47 Writing Agent log to /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/tests/results/TestBallastMemory/agent.log
2021/12/06 14:17:47 Agent running, pid=1858
2021/12/06 14:17:47 Stopped generator. Sent:         0 items
2021/12/06 14:17:47 Gracefully terminating Agent pid=1858, sending SIGTEM...
2021/12/06 14:17:47 Stopping process monitor.
2021/12/06 14:17:47 Agent process stopped, exit code=0
2021/12/06 14:17:47 Starting Agent (/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/bin/otelcontribcol_linux_amd64)
2021/12/06 14:17:47 Writing Agent log to /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/tests/results/TestBallastMemory/agent.log
2021/12/06 14:17:47 Agent running, pid=1865
    assertion_compare.go:342: 
        	Error Trace:	e2e_test.go:122
        	Error:      	"89" is not less than or equal to "88"
        	Test:       	TestBallastMemory
        	Messages:   	[]
2021/12/06 14:17:47 Stopped generator. Sent:         0 items
2021/12/06 14:17:47 Gracefully terminating Agent pid=1865, sending SIGTEM...
2021/12/06 14:17:47 Stopping process monitor.
2021/12/06 14:17:47 Agent process stopped, exit code=0
2021/12/06 14:17:47 Starting Agent (/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/bin/otelcontribcol_linux_amd64)
2021/12/06 14:17:47 Writing Agent log to /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/tests/results/TestBallastMemory/agent.log
2021/12/06 14:17:47 Agent running, pid=1872
2021/12/06 14:17:47 Stopped generator. Sent:         0 items
2021/12/06 14:17:47 Gracefully terminating Agent pid=1872, sending SIGTEM...
2021/12/06 14:17:47 Stopping process monitor.
2021/12/06 14:17:47 Agent process stopped, exit code=0
--- FAIL: TestBallastMemory (0.76s)

@jpkrohling
Copy link
Member

Is the limit currently 80, with a 10% margin?

@jpkrohling jpkrohling added the ci-cd CI, CD, testing, build issues label Dec 6, 2021
@boostchicken
Copy link
Member Author

Is the limit currently 80, with a 10% margin?

It seems to the be case it does 1.1 multiplier i believe there is a better way here though. what is the ballast and its purpose?

@jpkrohling
Copy link
Member

Ballast is a general strategy for Go (and other languages?) to trick the GC, so that it runs in a pattern that is closer to how the application behaves. This might help understand it: https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i-learnt-to-stop-worrying-and-love-the-heap/

But if you mean our ballast extension, here it is: https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/ballastextension

@boostchicken
Copy link
Member Author

Ahhh ok, thank you for the education. I have read that before, never had use for it, I can see how its critical here. How do we test this better? If my PR's fail on it what do I do?

@jpkrohling
Copy link
Member

We've been simply bumping this from time to time, with the idea that we'd catch big (>10% regressions) from individual PRs. The price to pay is that regular tests would start failing after some time, as our memory footprint increases naturally.

I'm not sure if there's a threshold where we would stop and investigate why we are consuming ~90MB of memory, as I'm not sure we should really be recommending the contrib distribution to anyone with its current size.

@boostchicken
Copy link
Member Author

I agree, nobody should take the kitchen sink. There are distro builders for this very purpose though? Or do we need to develop a way to help people pick and chose better with the contrib repo?

@jpkrohling
Copy link
Member

There was a discussion in the past, perhaps even recorded as an issue, about building a web interface for the builder, a la https://code.quarkus.io, where people can pick and choose the components they want/need and get a binary as a result. We are still a bit far from that, though.

@jpkrohling jpkrohling self-assigned this Dec 10, 2021
@jpkrohling jpkrohling changed the title TestBallastMemory randomly fails still, even with the old fix. Also, the output and details of the failures are meaningless. TestBallastMemory randomly fails still, even with the old fix Dec 10, 2021
jpkrohling referenced this issue in jpkrohling/opentelemetry-collector-contrib Dec 10, 2021
This also improves the error message, making it clear that the usage is 10% higher than the max.

Fixes #6535

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
bogdandrutu pushed a commit that referenced this issue Dec 11, 2021
This also improves the error message, making it clear that the usage is 10% higher than the max.

Fixes #6535

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
PaurushGarg referenced this issue in open-o11y/opentelemetry-collector-contrib Dec 11, 2021
This also improves the error message, making it clear that the usage is 10% higher than the max.

Fixes #6535

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
PaurushGarg referenced this issue in open-o11y/opentelemetry-collector-contrib Dec 11, 2021
This also improves the error message, making it clear that the usage is 10% higher than the max.

Fixes #6535

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
PaurushGarg referenced this issue in open-o11y/opentelemetry-collector-contrib Dec 14, 2021
This also improves the error message, making it clear that the usage is 10% higher than the max.

Fixes #6535

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
povilasv referenced this issue in coralogix/opentelemetry-collector-contrib Dec 19, 2022
Extended the documentation of the component to include safe Shutdown expectations.

Co-authored-by: Alex Boten <alex@boten.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-cd CI, CD, testing, build issues
Projects
None yet
2 participants