Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests running on test-docker* machines get terminated mid-run #2888

Closed
smlambert opened this issue Jan 19, 2023 · 25 comments
Closed

Tests running on test-docker* machines get terminated mid-run #2888

smlambert opened this issue Jan 19, 2023 · 25 comments

Comments

@smlambert
Copy link
Contributor

Please set the title to indicate the test name and machine name where known.

To make it easy for the infrastructure team to repeat and diagnose, please
answer the following questions:

Any other details:

05:00:47  LT  10:00:43.399 - Completed 26.7%. Number of tests started=18351 (+1960)
05:01:06  LT  10:01:03.425 - Completed 33.4%. Number of tests started=21426 (+3075)
05:01:24  LT  10:01:23.448 - Completed 40.1%. Number of tests started=24242 (+2816)
05:01:25  settings.mk:356: recipe for target 'extended.system-..' failed
05:01:25  make[1]: *** [extended.system-..] Terminated
05:01:25  makefile:49: recipe for target '_extended.system' failed
05:01:25  make: *** [_extended.system] Terminated
05:01:25  Terminated
05:01:25  /home/jenkins/workspace/Test_openjdk17_hs_extended.system_aarch64_linux/aqa-tests/TKG/../TKG/settings.mk:356: recipe for target 'extended.system-system' failed
05:01:25  make[2]: *** [extended.system-system] Terminated
05:01:25  /home/jenkins/workspace/Test_openjdk17_hs_extended.system_aarch64_linux/aqa-tests/TKG/../TKG/settings.mk:356: recipe for target 'extended.system-otherLoadTest' failed
05:01:25  make[3]: *** [extended.system-otherLoadTest] Terminated
05:01:25  autoGen.mk:54: recipe for target 'MiniMix_5m_0' failed
05:01:25  make[4]: *** [MiniMix_5m_0] Terminated

There were several cases seen during release triage, I will add more examples to this issue shortly.

@smlambert
Copy link
Contributor Author

@andrew-m-leonard
Copy link
Contributor

@smlambert I am a bit suspicious it is the begin/end process clean logic, as i've run some process queries on docker container node test-docker-ubuntu1804-armv8l-4, and it shows 2 Jenkins Agents visible, which would seem to imply this docker container can see processes in the other containers on the same host dockerhost-equinix-ubuntu2004-armv8-1 ?

Agents from CONTAINER test-docker-ubuntu1804-armv8l-4:
12:hugetlb:/docker/a96fd12a 40 S ? 01:38:18 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/a96fd12a 2768648 S ? 00:36:42 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache

Agents from HOST dockerhost-equinix-ubuntu2004-armv8-1:
12:hugetlb:/docker/a96fd12a 198369 S ? 00:36:46 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/a0717563 1310315 S ? 01:36:18 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/a96fd12a 1310331 S ? 01:38:20 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/a0717563 1310332 S ? 01:34:01 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/1bfd2a03 1310338 S ? 01:29:20 /usr/lib/jvm/jdk17/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/412c7c34 1310350 S ? 01:37:29 /usr/lib/jvm/jdk17/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/db948949 1310395 S ? 01:33:24 /usr/lib/jvm/jdk17/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/6e5b2bb3 1310422 S ? 01:35:19 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Jan 20, 2023

What looks odd above is it looks like some containers have 2 jenkins Agents, these 2 containers:
docker/a96fd12a
docker/a0717563

Whereas these containers only have 1 jenkins Agent:
docker/1bfd2a03
docker/412c7c34
docker/db948949
docker/6e5b2bb3

If jenkins schedules 2 tasks within the same container, they could end up terminating each others processes ?

@andrew-m-leonard
Copy link
Contributor

Using this command from Scripting Console: println "ps -o cgroup,pid,state,tname,time,command -u jenkins".execute().text

@smlambert
Copy link
Contributor Author

https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_special.functional_aarch64_linux/46/console

07:13:19  compile:
07:13:19       [echo] Ant version is Apache Ant(TM) version 1.10.5 compiled on July 10 2018
07:13:19       [echo] ============COMPILER SETTINGS============
07:13:19       [echo] ===fork:				yes
07:13:19       [echo] ===executable:			/home/jenkins/workspace/Test_openjdk19_hs_special.functional_aarch64_linux/openjdkbinary/j2sdk-image/bin/javac
07:13:19       [echo] ===debug:				on
07:13:19       [echo] ===destdir:				/home/jenkins/workspace/Test_openjdk19_hs_special.functional_aarch64_linux/aqa-tests/TKG/../../jvmtest/functional/MBCS_Tests/new_jp_era
07:13:19      [javac] Compiling 1 source file to /home/jenkins/workspace/Test_openjdk19_hs_special.functional_aarch64_linux/aqa-tests/functional/MBCS_Tests/new_jp_era/bin
07:13:20  make[1]: *** [compile.mk:45: compile] Terminated
07:13:20  make: *** [makefile:67: compile] Terminated
07:13:20  Terminated
07:13:20  143

@smlambert
Copy link
Contributor Author

https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_extended.system_aarch64_linux/163/console

07:09:11      [javac] 			assertTrue("Expected \"" + b.get(new Integer(x)).getString() + "success\" but found \"" + a.get(new Integer(x)).getString(),a.get(new Integer(x)).getString().equals(b.get(new Integer(x)).getString() + "success"));
07:09:11      [javac] 			                                                                                                                                  ^
07:09:11  make[1]: *** [compile.mk:45: compile] Terminated
07:09:11  make: *** [makefile:67: compile] Terminated
07:09:11  Terminated
07:09:11  143

@smlambert
Copy link
Contributor Author

https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_sanity.system_aarch64_linux/163/

11:29:21  LT  16:29:20.093 - Completed 6.7%. Number of tests started=193174
11:29:35  make[1]: *** [settings.mk:356: sanity.system-..] Terminated
11:29:35  make: *** [makefile:50: _sanity.system] Terminated
11:29:35  Terminated

@smlambert
Copy link
Contributor Author

https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_aarch64_linux/110/

09:28:29  make[1]: *** [extended.openjdk-..] Terminated
09:28:29  makefile:49: recipe for target '_extended.openjdk' failed
09:28:29  make: *** [_extended.openjdk] Terminated
09:28:29  Terminated
09:28:29  /home/jenkins/workspace/Test_openjdk11_hs_extended.openjdk_aarch64_linux/aqa-tests/TKG/../TKG/settings.mk:356: recipe for target 'extended.openjdk-openjdk' failed
09:28:29  make[2]: *** [extended.openjdk-openjdk] Terminated
09:28:29  autoGen.mk:1694: recipe for target 'jdk_management_1' failed
09:28:29  make[3]: *** [jdk_management_1] Terminated

@smlambert
Copy link
Contributor Author

https://ci.adoptopenjdk.net/job/Test_openjdk8_hs_extended.system_aarch64_linux/773/console

11:32:32  LT  16:32:31.212 - Starting thread. Suite=0 thread=9
11:32:54  LT  16:32:51.240 - Completed 6.7%. Number of tests started=3759
11:33:01  settings.mk:356: recipe for target 'extended.system-..' failed
11:33:01  make[1]: *** [extended.system-..] Terminated
11:33:01  makefile:49: recipe for target '_extended.system' failed
11:33:01  make: *** [_extended.system] Terminated
11:33:01  Terminated
11:33:01  /home/jenkins/workspace/Test_openjdk8_hs_extended.system_aarch64_linux/aqa-tests/TKG/../TKG/settings.mk:356: recipe for target 'extended.system-system' failed
11:33:01  make[2]: *** [extended.system-system] Terminated
11:33:01  /home/jenkins/workspace/Test_openjdk8_hs_extended.system_aarch64_linux/aqa-tests/TKG/../TKG/settings.mk:356: recipe for target 'extended.system-otherLoadTest' failed
11:33:01  make[3]: *** [extended.system-otherLoadTest] Terminated
11:33:01  autoGen.mk:54: recipe for target 'MiniMix_5m_0' failed
11:33:01  make[4]: *** [MiniMix_5m_0] Terminated
11:33:01  STF 16:33:01.532 - **FAILED** Process LT  ended with exit code (143) and not the expected exit code/s (0)
11:33:01  STF 16:33:01.532 - Monitoring Report Summary:
11:33:01  STF 16:33:01.533 -   o Process LT  ended with exit code (143) and not the expected exit code/s (0)
11:33:01  STF 16:33:01.533 - Killing processes: LT 
11:33:01  STF 16:33:01.533 -   o Process LT  pid 3508694 is not running
11:33:01  **FAILED** at step 1 (Run mixed unit tests). Expected return value=0 Actual=1 at /home/jenkins/workspace/Test_openjdk8_hs_extended.system_aarch64_linux/aqa-tests/TKG/../TKG/output_16740595478139/MiniMix_5m_0/20230118-163229-MixedLoadTest/execute.pl line 94.
11:33:01  STF 16:33:01.732 - **FAILED** execute script failed. Expected return value=0 Actual=1

@andrew-m-leonard
Copy link
Contributor

@smlambert I am fairly sure this is the process cleanup, kill visible jenkins process from other containers, there must be something special about the container environment here that I need to investigate.
Would you like me to disable the kill logic with a PR for the release branch?

@smlambert
Copy link
Contributor Author

thanks @andrew-m-leonard yes please. I will keep adding examples to this issue as I find them, in case it helps us for a revised solution.

@sxa
Copy link
Member

sxa commented Jan 25, 2023

https://ci.adoptopenjdk.net/computer/test%2Ddocker%2Dubuntu1804%2Darmv8l%2D2/
and https://ci.adoptopenjdk.net/computer/test%2Ddocker%2Dubuntu1804%2Darmv8l%2D4
were both on the same container so were interfering with each other. I've shut down the first of those so it shouldn't recur there.

https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_sanity.system_aarch64_linux/163/consoleFull which was run on https://ci.adoptopenjdk.net/computer/test%2Ddocker%2Dfedora35%2Darmv8l%2D1 does not have the same problem, so that issue is not the same. Other than this one has this occurred again other than on the two "duplicate" agent definitions which I've resolved?

@andrew-m-leonard
Copy link
Contributor

@andrew-m-leonard Have you implemented any additional process killing in the build stuff anywhere?
No

I've just looked on test-docker-fedora35-armv8l-1, and it looks as though it has multiple jenkins Agents as well ?

12:hugetlb:/docker/a0717563      21 S ?        00:22:47 sshd: jenkins@notty
12:hugetlb:/docker/a0717563      31 S ?        00:24:14 sshd: jenkins@notty
12:hugetlb:/docker/a0717563      40 S ?        00:00:00 sh -c cd "/home/jenkins" && /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/a0717563      41 S ?        02:09:28 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/a0717563      53 S ?        00:00:00 sh -c cd "/home/jenkins" && /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
12:hugetlb:/docker/a0717563      54 S ?        02:13:57 /usr/lib/jvm/jdk8/bin/java -Xmx512m -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache

I can't check ps on test-docker-ubi8-armv8-1

The aqa-tests teminateProcess.sh logic assumes that if:

  • node is running inside a docker container: any remnant jenkins test processes should be cleaned up
  • node is a linux host: filter out any processes that are running in a docker cgroup

So my suspicion is docker containers that are serving 2 jenkins agents will kill off each others processes....

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Jan 27, 2023

@sxa There are two jenkins agents launched in container port 2235 on host 147.75.35.203, because there are two Node definitions targeting that container:
Correction, it's these two are the same container 147.75.35.203 port 2235:
test-docker-fedora35-armv8l-1
test-docker-ubuntu2110-armv8l-1

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Jan 27, 2023

These two are also running on the same docker container:
container 139.178.86.243 port 2236:
image

@andrew-m-leonard
Copy link
Contributor

@sxa The premise for the process cleaning on a "host" is the assumption that the jenkins owned Test processes should be terminated if found running, but I am thinking that assumption is incorrect, since a host could have multiple "Executors"(Agents) hence all running independent Test job processes under the jenkins user, and with this assumption would potentially incorrectly terminate each others processes?

The docker containers with multiple Agents, illustrate the same problem, although I suspect that is not intentional. @steelhead31 The above Node definitions using the same containers doesn't seem right?
147.75.35.203 port 2235:
test-docker-fedora35-armv8l-1
test-docker-ubuntu2110-armv8l-1
139.178.86.243 port 2236:
test-docker-fedora36-aarch64-1
test-docker-fedora36-armv8-1

I am suspecting test-docker-ubi8-armv8-1 on 139.178.86.243 port 2247, has a duplicate, although I have not found one!, it's not easy to search all node configurations by host and port.

@steelhead31
Copy link
Contributor

@andrew-m-leonard I'll have a look at these duplicates, I suspect something has gone awry.. .I also can probably find the duplicates via the jenkins api :)

@steelhead31 steelhead31 self-assigned this Jan 27, 2023
@steelhead31
Copy link
Contributor

Im currently performing an audit of all the docker nodes in jenkins, once I have this, we can remove any defunct ones, identify any duplicates and sort those out too... once this is done we can retry some tests, and determine any further actions.

@steelhead31
Copy link
Contributor

Ive produced an audit of the docker related hosts and machines...

https://drive.google.com/file/d/1hNtQ_BOrAfV4FWj961dgT8hH9zw4EFcn/view?usp=sharing

@steelhead31
Copy link
Contributor

steelhead31 commented Jan 30, 2023

 have 6 machines / 3 duplicates, looks to be caused by labelleing.. I'll remove the duplicates from jenkins.

test-docker-ubuntu2004-aarch64-1
test-docker-ubuntu2004-armv8-1
test-docker-alma8-aarch64-1
test-docker-ubuntu2204-armv8-1
test-docker-fedora36-aarch64-1
test-docker-fedora36-armv8-1

@steelhead31
Copy link
Contributor

Now removed ( as these 3 are duplicates )
test-docker-ubuntu2004-aarch64-1

test-docker-ubuntu2204-armv8-1

test-docker-fedora36-aarch64-1

@steelhead31
Copy link
Contributor

Now removed ( as these 3 are duplicates )
test-docker-ubuntu2004-aarch64-1

test-docker-ubuntu2204-armv8-1

test-docker-fedora36-aarch64-1

@sxa
Copy link
Member

sxa commented Jan 30, 2023

since a host could have multiple "Executors"(Agents) hence all running independent Test job processes under the jenkins user

Just to be clear on this, no systems labelled for test should have more than one executor. If they do, it's defintely a bug that needs to be resolved, so thanks Scott for dealing with these :-)

@steelhead31 steelhead31 added this to the 2023-01 (January) milestone Jan 31, 2023
@steelhead31
Copy link
Contributor

I've resolved the docker agent/multiple executors, and successfully run several test suites on these problematic machines without issue. I'll close this issue for now, @smlambert if you find any more occurences of this after today, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants