Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athens leaks defunct zombie git processes #1155

Closed
bradfitz opened this issue Mar 23, 2019 · 14 comments · Fixed by #1190
Closed

Athens leaks defunct zombie git processes #1155

bradfitz opened this issue Mar 23, 2019 · 14 comments · Fixed by #1190
Labels
bug Something isn't working

Comments

@bradfitz
Copy link

bradfitz commented Mar 23, 2019

Describe the bug

I've had an Athens (0.2) pod running in Kubernetes since 24 Jan 2019. It's leaked 3,675 zombie processes since then.

Error Message

    PID TTY      STAT   TIME COMMAND
3136069 ?        Ss     0:00 /bin/sh
3136083 ?        R+     0:00  \_ ps afx
3136084 ?        R+     0:00  \_ less
      1 ?        Ssl  1216:37 /bin/athens-proxy -config_file=/config/config.toml
  14066 ?        Z      0:00 [git-remote-http] 
  16603 ?        Z      0:00 [git-remote-http] 
  16817 ?        Z      0:00 [git-remote-http] 
  21024 ?        Z      0:00 [git-remote-http] 
  21078 ?        Z      0:00 [git-remote-http] 
  21368 ?        Z      0:00 [git-remote-http] 
  21419 ?        Z      0:00 [git-remote-http] 
  21737 ?        Z      0:00 [git-remote-http] 
  21778 ?        Z      0:00 [git-remote-http] 
  21826 ?        Z      0:00 [git-remote-http] 
  21862 ?        Z      0:00 [git-remote-http] 
  30244 ?        Z      0:00 [git-remote-http] 
  32825 ?        Z      0:00 [git-remote-http] 
  33181 ?        Z      0:00 [git-remote-http] 
  34421 ?        Z      0:00 [git-remote-http] 
  35672 ?        Z      0:00 [git-remote-http] 
  44237 ?        Z      0:00 [git-remote-http] 
  44265 ?        Z      0:00 [git] 
  47997 ?        Z      0:00 [git-remote-http] 
  48175 ?        Z      0:00 [git-remote-http] 
....

/ # ps afx | grep defunct | wc -l
3675

To Reproduce

Run Athens for a bit.

Expected behavior

Processes are waited upon and don't leave zombies.

Environment (please complete the following information):

  • OS: linux
  • Go version : whatever the official 0.2 container was built with
  • Proxy version : 0.2
  • Storage (fs/mongodb/s3 etc.) : fs
@marpio marpio added the bug Something isn't working label Mar 23, 2019
@marwan-at-work
Copy link
Contributor

marwan-at-work commented Mar 23, 2019

@bradfitz Athens does not use git directly. It shells out to go which uses git internally.
The only two commands that Athens uses which could end up using git are

  1. go mod download here: https://github.com/gomods/athens/blob/master/pkg/module/go_get_fetcher.go#L131
  2. go list -m here: https://github.com/gomods/athens/blob/master/pkg/download/upstream_lister.go#L52

In both cases, we run cmd.Run() which waits for the go command to exit. Is there any more cleaning up we'd have to do on the Athens side to ensure children processes are cleaned up?

Otherwise, I imagine this might be a bug in go modules itself?

@arschles
Copy link
Member

Also iirc Athens v0.2.0 is built with go version 1.11.x (I think 1.11.4 but not at my machine to confirm), in case that helps

@marwan-at-work
Copy link
Contributor

@arschles Athens v0.2.0 uses golang:1.11-alpine while Athens v0.3.1 uses golang:1.12-alpine here

If I'm looking at the code from 1.11 correctly, running git commands for modules comes from here which seems to use Run as well.

Hopefully I'm not looking at the right code because otherwise I'm not sure where the processes are being abandoned.

@bradfitz is there any way to see the actual command that ran those abandoned processes?

Thanks

@bradfitz
Copy link
Author

Actually I noticed that athens-proxy is pid 1 above.

I suspect this is just a packaging issue in your container.

You should probably wrap athens-proxy in https://github.com/krallin/tini (see golang/go#23705 where Go discovered this too for some of our services)

@marwan-at-work
Copy link
Contributor

@bradfitz

Disclaimer: I'm using this issue as a learning moment because it's the first time I hear about Docker/GKE not cleaning up properly finished processes. So thank you very much for taking the time to report the issue and help out :)

As for your last comment,

I suspect this is just a packaging issue in your container.

Do you mean we should just include tini? Or do you mean there's an actual mistake in our Dockerfile?

Here is the Dockerfile we use for reference: https://github.com/gomods/athens/blob/master/cmd/proxy/Dockerfile

According to Tini's author in What is the advantage of Tini?, if the code knows not to leave defunct processes then you don't need Tini. Tini is mainly for running processes that you don't have control over like Jenkins.

In our case we do have control over both Athens and Go. So my thinking is that Tini would work but it would not solve the underlying solution of some bad code somewhere not properly waiting on child processes. The only possibility I see is that git might run a child git command and doesn't wait for it, but that sounds a lil ridiculous.

Furthermore, I saw the following in the tini README:

NOTE: If you are using Docker 1.13 or greater, Tini is included in Docker itself. This includes all versions of Docker CE. To enable Tini, just pass the --init flag to docker run.

Can you just pass the --init flag when you run Athens in your environment?

Which makes me think that --init should always be defaulted to true on the Docker side no? Because if all it does is fix an issue that has nothing to do with the software we write, then why not just default it to true.

Lastly, the issue you linked seemed to be a GKE problem here: golang/go#23705 (comment)

Could this mean that GKE might have this issue fixed and all you need to do is update the cluster?

Lastly, I'm obviously very happy to include tini in our Dockerfile, but I'd love to know why and if there's a possibility of fixing some code somewhere, that'd be more ideal (we can include tini here for now, and hunt for the real problem down the line).

It also seems like tini wouldn't hurt to be included on the Athens Dockerfile in case someone else is using an old GKE cluster or an old Docker version...but it still blows my mind a tiny bit that we have to worry about this type of stuff on the application layer.

Thanks again!

@jbub
Copy link

jbub commented Mar 27, 2019

Hmm i think i saw those defunct git processes at some point with 0.2, havent seen that happening since i have updated to 0.3. It was maybe related to this census-instrumentation/opencensus-go#1046.

@jbub
Copy link

jbub commented Mar 27, 2019

Nevermind i just checked and can see 33 zombie git processes with gomods/athens:v0.3.1.

@arschles
Copy link
Member

arschles commented Apr 8, 2019

What do folks think about this problem? @bradfitz did you happen to update GKE and see if that fixed the problem? Otherwise, can we give you a custom image, which executes Athens with tini, to try in your cluster?

@marwan-at-work
Copy link
Contributor

@bradfitz v0.4.0 is released which has tini in the Dockerfile (docker pull gomods/athens:v0.4.0) -- feel free to check it out and please let us know if the issue still persists.

@atomi
Copy link

atomi commented Jun 1, 2019

I personally have done something similar to https://github.com/ramr/go-reaper in my own containers.
It may be an option here.

@arschles
Copy link
Member

arschles commented Jun 7, 2019

@atomi #1260 has a fix for this. Can you take a look and see if it's similar to what you were thinking?

@atomi
Copy link

atomi commented Jun 7, 2019

@arschles I don't have any issues with tini. It makes a lot of sense since Athens already has dependencies from alpine such as vcs and openssh packages.
I think using tini in this case is more in line with "keeping it simple." Apologies for adding extra noise here. You're all good. 👍

@arschles
Copy link
Member

@atomi that's all good! I'm really glad you brought up your go-reaper project 😄, I'll go check it out sometime. Cheers!

@atomi
Copy link

atomi commented Jun 11, 2019

@arschles It's not my project. I've just used it in the past before --init was available.

It' was an interesting solution at the time. Anyway, Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants