Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changing the way core files are stored in SmartOS VMs so that we can actually use them #492

Closed
misterdjules opened this issue Sep 15, 2016 · 21 comments

Comments

@misterdjules
Copy link
Contributor

This is related to nodejs/node#7649, where a test running on SmartOS made a node process abort, and thus made the system generate a core file.

That core file could have helped us root-causing the problem, but unfortunately it wasn't available anymore because by default core files are stored in the global zone of the server on which the VM runs, and they're deleted after one week.

Another problem with the default setup for core file storage with SmartOS test VMs is that only Triton cloud's operators can access core files stored in the server's global zone.

What we could do is set the configuration of every SmartOS test VM so that:

  1. core files are stored somewhere that build WG members can access
  2. core files are cleaned up after some time (e.g 1 week) by a cron job to avoid filling up space

This way, when a test failure happens due to a node process aborting on a SmartOS test VM, the person who ran the CI tests job can ask a member of the build WG to get the core file and e.g upload it to manta so that it can be inspected with mdb_v8.

Regardless of whether the file needs to be inspected, it would be deleted after a week, and not fill up space on test VMs.

Does that sound like a useful thing to do? If so I can help set it up, just let me know.

@jbergstroem
Copy link
Member

How about configuring core dump directory to /home/iojs/dump or smth together with a cron job then? (find -mtime -delete)

@jbergstroem
Copy link
Member

I'm not very well versed on coreadm, but If you have a suggested route wrt dumping to another directory I can update playbooks.

@misterdjules
Copy link
Contributor Author

How about configuring core dump directory to /home/iojs/dump or smth together with a cron job then? (find -mtime -delete)

That's exactly what I'm suggesting, with the nit that I would choose a different name for this directory: /home/iojs/cores instead of /home/iojs/dump, but I don't have a strong opinion about it.

I'm not very well versed on coreadm, but If you have a suggested route wrt dumping to another directory I can update playbooks.

Running coreadm -g /home/iojs/cores/core.%f.%p -e global will enable global core files and will store them in /home/iojs/cores. It will also persist across reboots.

We should also make sure that ulimit -c outputs unlimited for the user under which the tests run.

@jbergstroem
Copy link
Member

Cool. I'm actually in the process of setting up smartos15 and 16 hosts. I'll look at incorporating this in the playbook.

@misterdjules misterdjules changed the title changing the way core files are stored in SmartOS VMs so that we can use actually use them changing the way core files are stored in SmartOS VMs so that we can actually use them Sep 16, 2016
@jbergstroem
Copy link
Member

Sorry for the delay here -- i'm a bit swamped at the moment. If anyone wants to chip in with improving playbooks for smartos14..16 that would be appreciated. I can spin up test machines if required.

Also, protip from @chorrell -- we can disable the SmartLogin (solving shared key access boundaries) by doing something in style with:

  • remove/comment PubKeyPlugin libsmartsshd.so in /etc/sshd/ssd_config
  • restart sshd (svcadm restart sshd)

@jbergstroem
Copy link
Member

So, I spent some time on smartos today. Looks like we're running into issues with your openjdk8:

# /opt/local/java/openjdk8/bin/java -Xmx128m -jar slave.jar -jnlpUrl https://ci.nodejs.org/computer/test-joyent-smartos15-x64-1/slave-agent.jnlp -secret foo
Exception in thread "main" java.lang.Error: Error during hash calculation
        at sun.security.ssl.HandshakeHash.getFinishedHash(HandshakeHash.java:249)
        at sun.security.ssl.HandshakeMessage$Finished.getFinished(HandshakeMessage.java:1952)
        at sun.security.ssl.HandshakeMessage$Finished.<init>(HandshakeMessage.java:1899)
        at sun.security.ssl.ClientHandshaker.sendChangeCipherAndFinish(ClientHandshaker.java:1214)
        at sun.security.ssl.ClientHandshaker.serverHelloDone(ClientHandshaker.java:1134)
        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:348)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
        at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
        at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:269)
        at hudson.remoting.Launcher.run(Launcher.java:219)
        at hudson.remoting.Launcher.main(Launcher.java:192)
Caused by: java.lang.RuntimeException: Could not clone digest
        at sun.security.ssl.HandshakeHash.cloneDigest(HandshakeHash.java:194)
        at sun.security.ssl.HandshakeHash.getFinishedHash(HandshakeHash.java:247)
        ... 17 more
Caused by: java.lang.CloneNotSupportedException: SHA-384
        at sun.security.pkcs11.P11Digest.clone(P11Digest.java:316)
        at java.security.MessageDigest$Delegate.clone(MessageDigest.java:560)
        at sun.security.ssl.HandshakeHash.cloneDigest(HandshakeHash.java:191)
        ... 18 more
Caused by: sun.security.pkcs11.wrapper.PKCS11Exception: CKR_STATE_UNSAVEABLE
        at sun.security.pkcs11.wrapper.PKCS11.C_GetOperationState(Native Method)
        at sun.security.pkcs11.P11Digest.clone(P11Digest.java:311)
        ... 20 more

I don't have more time to look into that, but an up to date playbook is available in my repo.

@jbergstroem
Copy link
Member

..same happens for smartos16.

@chorrell
Copy link

chorrell commented Oct 3, 2016

That's odd. I use a base-64-lts 15.4.0 with jenkins for Joyent image builds and openjdk8 works fine on that node. Is there more than one version of Java install, like a sun-jre package?

@jbergstroem
Copy link
Member

@chorrell no, just openjdk8

@chorrell
Copy link

chorrell commented Oct 3, 2016

This might be relevant: https://www.illumos.org/issues/7227

@chorrell
Copy link

chorrell commented Oct 3, 2016

So maybe:

sun.security.pkcs11.enable-solaris=false

@jbergstroem
Copy link
Member

@chorrell: "-Dsun.security.pkcs11.enable-solaris=false" is not a valid option

@jbergstroem
Copy link
Member

@chorrell sorry, it does work -- I just messed up ordering.

@jbergstroem
Copy link
Member

First run with 15,16 here: https://ci.nodejs.org/job/node-test-commit-smartos/4584/

@jbergstroem
Copy link
Member

This has been implemented on all hosts and are available in the playbooks in my refactor.

@misterdjules
Copy link
Contributor Author

@jbergstroem Thank you very much for your work! Should this issue be closed?

@jbergstroem
Copy link
Member

I guess we could, but seeing how we still need to land my PR it would be slightly misleading?

@misterdjules
Copy link
Contributor Author

seeing how we still need to land my PR it would be slightly misleading

What PR are you referring to?

@gibfahn
Copy link
Member

gibfahn commented Feb 2, 2017

I assume this one: #606

@misterdjules
Copy link
Contributor Author

@gibfahn Thanks for the context!

Let's not close this issue until that PR is merged then.

@misterdjules
Copy link
Contributor Author

#606 was merged a while ago, so closing. Thank you very much @jbergstroem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants