Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-worker-cleanexit-with-moduleload crashing on multiple platforms in CI #39036

Closed
Trott opened this issue Jun 15, 2021 · 26 comments
Closed

test-worker-cleanexit-with-moduleload crashing on multiple platforms in CI #39036

Trott opened this issue Jun 15, 2021 · 26 comments

Comments

@Trott
Copy link
Member

Trott commented Jun 15, 2021

  • Test: test-worker-cleanexit-with-moduleload
  • Platform: debian10-x64 and pi2-docker, likely others
  • Console Output:
00:26:11 not ok 714 parallel/test-worker-cleanexit-with-moduleload
00:26:11   ---
00:26:11   duration_ms: 6.955
00:26:11   severity: crashed
00:26:11   exitcode: -11
00:26:12   stack: |-
00:26:12   ...
11:27:12 not ok 2626 parallel/test-worker-cleanexit-with-moduleload
11:27:12   ---
11:27:12   duration_ms: 1.19
11:27:12   severity: crashed
11:27:12   exitcode: -11
11:27:12   stack: |-
11:27:12   ...
@Trott
Copy link
Member Author

Trott commented Jun 15, 2021

@gireeshpunathil

@richardlau
Copy link
Member

@Trott
Copy link
Member Author

Trott commented Jul 9, 2021

@nodejs/workers

@Trott
Copy link
Member Author

Trott commented Jul 9, 2021

@nodejs/workers

Although I guess this isn't exactly a worker threads test. From Gireesh's commit message introducing the test:

Execute many module loads in worker in a loop
while exiting from the main thread at arbitrary
execution points, and make sure that the workers
quiesce without crashing.

worker_threads are not necessarily the subject of
testing, those are used for easy simulation of
multi-thread scenarios.

@Trott
Copy link
Member Author

Trott commented Jul 9, 2021

I can replicate locally (on macOS) by just repeating the test many times:

tools/test.py --repeat=1024 test/parallel/test-worker-cleanexit-with-moduleload.js
=== release test-worker-cleanexit-with-moduleload ===                   
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===                   
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===                   
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===                   
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===                   
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
[08:10|% 100|+ 1019|-   5]: Done 
$ 

@Trott
Copy link
Member Author

Trott commented Jul 9, 2021

Here's the information about the crashed thread from the diagnostic log:

Thread 13 Crashed:
0   node                          	0x000000010d859610 node::EnabledDebugList::enabled(node::DebugCategory) const + 0 (debug_utils.h:66) [inlined]
1   node                          	0x000000010d859610 void node::Debug<>(node::AsyncWrap*, char const*) + 8 (debug_utils-inl.h:194) [inlined]
2   node                          	0x000000010d859610 node::worker::MessagePort::OnMessage(node::worker::MessagePort::MessageProcessingMode) + 48 (node_messaging.cc:728)
3   node                          	0x000000010e20015b uv__async_io + 379 (async.c:163)
4   node                          	0x000000010e2139ac uv__io_poll + 2028 (kqueue.c:375)
5   node                          	0x000000010e200651 uv_run + 417 (core.c:385)
6   node                          	0x000000010d7b8ac9 node::Environment::CleanupHandles() + 377 (env.cc:645)
7   node                          	0x000000010d7b8eca node::Environment::RunCleanup() + 314 (env.cc:677)
8   node                          	0x000000010d755f01 node::FreeEnvironment(node::Environment*) + 113 (environment.cc:372)
9   node                          	0x000000010d8b744e node::FunctionDeleter<node::Environment, &(node::FreeEnvironment(node::Environment*))>::operator()(node::Environment*) const + 8 (util.h:636) [inlined]
10  node                          	0x000000010d8b744e std::__1::unique_ptr<node::Environment, node::FunctionDeleter<node::Environment, &(node::FreeEnvironment(node::Environment*))> >::reset(node::Environment*) + 8 (memory:2623) [inlined]
11  node                          	0x000000010d8b744e node::worker::Worker::Run()::$_0::operator()() const + 68 (node_worker.cc:284) [inlined]
12  node                          	0x000000010d8b744e node::OnScopeLeaveImpl<node::worker::Worker::Run()::$_0>::~OnScopeLeaveImpl() + 68 (util.h:527) [inlined]
13  node                          	0x000000010d8b744e node::OnScopeLeaveImpl<node::worker::Worker::Run()::$_0>::~OnScopeLeaveImpl() + 68 (util.h:527) [inlined]
14  node                          	0x000000010d8b744e node::worker::Worker::Run() + 990 (node_worker.cc:350)
15  node                          	0x000000010d8baff2 node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::$_3::operator()(void*) const + 25 (node_worker.cc:596) [inlined]
16  node                          	0x000000010d8baff2 node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::$_3::__invoke(void*) + 50 (node_worker.cc:585)
17  libsystem_pthread.dylib       	0x00007fff205d18fc _pthread_start + 224
18  libsystem_pthread.dylib       	0x00007fff205cd443 thread_start + 15

Thread 13 crashed with X86 Thread State (64-bit):
  rax: 0x00007fb6058af480  rbx: 0x0000000000000000  rcx: 0x00000000058af4c0  rdx: 0x000070000fc091b0
  rdi: 0x00007fb603e658e0  rsi: 0x0000000000000000  rbp: 0x000070000fc091a0  rsp: 0x000070000fc090c0
   r8: 0x00000000000000c7   r9: 0x000070000fc09600  r10: 0x00007fb604008800  r11: 0x0000000000000246
  r12: 0x00007fb603e659b0  r13: 0x00007fb603e65948  r14: 0x000070000fc11af8  r15: 0x00007fb603e658e0
  rip: 0x000000010d859610  rfl: 0x0000000000000206  cr2: 0x000024d14b64c000
  
Logical CPU:     0
Error Code:      0x00000000
Trap Number:     221

Thread 13 instruction stream not available.

Thread 13 last branch register state not available.

@gireeshpunathil
Copy link
Member

reproduced on a linux bare metal with the latest.

here is my crash scenario:

(gdb) where
#0  0x00000000010ae5c3 in v8::internal::JSReceiver::GetCreationContext() ()
#1  0x0000000000d155c8 in v8::Object::GetCreationContext() ()
#2  0x0000000000b6cab3 in node::worker::MessagePort::OnMessage(node::worker::MessagePort::MessageProcessingMode) ()
#3  0x00000000015b37f6 in uv__async_io (loop=0x7f75d3ffea98, 
    w=<optimized out>, events=<optimized out>)
    at ../deps/uv/src/unix/async.c:163
#4  0x00000000015c6576 in uv__io_poll (loop=loop@entry=0x7f75d3ffea98, 
    timeout=<optimized out>) at ../deps/uv/src/unix/linux-core.c:462
#5  0x00000000015b4128 in uv_run (loop=0x7f75d3ffea98, mode=UV_RUN_ONCE)
    at ../deps/uv/src/unix/core.c:385
#6  0x0000000000aa9550 in node::Environment::CleanupHandles() ()
#7  0x0000000000ab67cb in node::Environment::RunCleanup() ()
#8  0x0000000000a668e9 in node::FreeEnvironment(node::Environment*) ()
#9  0x0000000000be8e91 in node::worker::Worker::Run() ()
#10 0x0000000000be93d8 in node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::{lambda(void*)#1}::_FUN(void*) ()
#11 0x00007f75f438814a in start_thread () from /lib64/libpthread.so.0
#12 0x00007f75f40b7dc3 in clone () from /lib64/libc.so.6
(gdb) 
(gdb) x/5i _ZN2v88internal10JSReceiver18GetCreationContextEv
   0x10ae5c0 <_ZN2v88internal10JSReceiver18GetCreationContextEv>:	
    mov    rdx,QWORD PTR [rdi]
=> 0x10ae5c3 <_ZN2v88internal10JSReceiver18GetCreationContextEv+3>:	
    mov    rcx,QWORD PTR [rdx-0x1]
   0x10ae5c7 <_ZN2v88internal10JSReceiver18GetCreationContextEv+7>:	
    mov    rax,QWORD PTR [rcx+0x1f]
   0x10ae5cb <_ZN2v88internal10JSReceiver18GetCreationContextEv+11>:	
    test   al,0x1
   0x10ae5cd <_ZN2v88internal10JSReceiver18GetCreationContextEv+13>:	
    jne    0x10ae620 <_ZN2v88internal10JSReceiver18GetCreationContextEv+96>
(gdb) i r rdi
rdi            0x7f75d3ffaf48      140144044650312

From this and the one which @Trott captured earlier, it is evident that we are shutting down the worker (node::Environment::RunCleanup()), but the libuv is seen is actively executing callbacks (node::worker::MessagePort::OnMessage). Is this expected? Has anything changed in this area recently?

@addaleax
Copy link
Member

@gireeshpunathil Can you figure out why it is crashing in GetCreationContext()? I think that’s the interesting bit here.

but the libuv is seen is actively executing callbacks (node::worker::MessagePort::OnMessage).

That would be surprising, yes, but I’m not sure if we should really consider it illegal (unless the uv_close() callback has already been invoked by libuv)

@gireeshpunathil
Copy link
Member

JSReceiver receiver = *this;

@addaleax - the dereferencing is what is crashing, because the this pointer is not valid.

@gireeshpunathil
Copy link
Member

sorry, I may be wrong here too; let me dig a little more.

@gireeshpunathil
Copy link
Member

(gdb) x/2i 0x10ae5c0
// JSReceiver receiver = *this;
   0x10ae5c0 <v8::internal::JSReceiver::GetCreationContext>:	mov    rdx,QWORD PTR [rdi]
// Get the vft for the `receiver` object, crash
=> 0x10ae5c3 <v8::internal::JSReceiver::GetCreationContext+3>:	mov    rcx,QWORD PTR [rdx-0x1]

so it is not the this , but the receiver object, that is corrupt:

(gdb) i r rdi
rdi            0x7fcae6bf5f48      140509431422792
(gdb) x/2x 0x7fcae6bf5f48
0x7fcae6bf5f48:	0x00070001	0x00050004
(gdb) x/w 0x0005000400070000
0x5000400070000:	Cannot access memory at address 0x5000400070000

@addaleax - does it throw any hint?

@addaleax
Copy link
Member

@gireeshpunathil Well – it probably means that the JS object associated with the MessagePort has already been garbage collected at that point, or the handle we hold to it has been destroyed (which would happen before the former anyway). But that only happens in the uv_close() callback, so I guess the important question is: Is libuv sending messages to an uv_async_t if uv_close() has already been called on it?

@gireeshpunathil
Copy link
Member

If you are referring to uv_async_send or uv__async_send , those do not check the handle status , and assume that those are good. But is safe to rely on these primitives to manage such a race? Or should we do such a check in the first async callback (in this case node::worker::MessagePort::OnMessage) and abandon the propagation if we are in the cleanup phase?

or are you thinking that this is not a worker-specific issue, but a wider issue with the libuv itself?

@addaleax
Copy link
Member

addaleax commented Jul 16, 2021

If you are referring to uv_async_send or uv__async_send , those do not check the handle status , and assume that those are good.

I think this is expected.

But is safe to rely on these primitives to manage such a race?

I don’t understand this question, tbh. Which race? Which primitives?

Or should we do such a check in the first async callback (in this case node::worker::MessagePort::OnMessage) and abandon the propagation if we are in the cleanup phase?

Again, not sure what “such a check” or “first async callback” refer to, but for the second part, no, we should rely on the fact that we should not be receiving callbacks from libuv after having fully closed the handle, because if that wasn’t true that would be unsafe behavior on libuv’s end anyway.

or are you thinking that this is not a worker-specific issue, but a wider issue with the libuv itself?

Maybe? That’s what my question above was focused on finding out.

@gireeshpunathil
Copy link
Member

I don’t understand this question, tbh. Which race? Which primitives?

let me explain my understanding of the issue. I am not claiming complete comprehension of it, only trying to make meaning out of the callstack at the failing site.

In Environment::CleanupHandles() we initiate close of handles. I say initiate, because uv_close is asynchronous.

node/src/env.cc

Lines 635 to 636 in 1bbe66f

for (HandleWrap* handle : handle_wrap_queue_)
handle->Close();

This makes sure any subsequent message handling will not progress, as we don't send anything if the handles are closed or closing:

node/src/node_messaging.cc

Lines 625 to 628 in bcf73d6

void MessagePort::TriggerAsync() {
if (IsHandleClosing()) return;
CHECK_EQ(uv_async_send(&async_), 0);
}

Then we run uv_run. There, we handle events, potentially on closed / closing handles, (due to the asynchrony of the uv_close). We progress on callbacks with bad stuff and crash.

does it sound reasonable to you?

@addaleax
Copy link
Member

@gireeshpunathil Yes – but it should actually be fine for a message to be received on a closing handle, as long as it’s not fully closed yet.

@gireeshpunathil
Copy link
Member

the important question is: Is libuv sending messages to an uv_async_t if uv_close() has already been called on it?

diff --git a/deps/uv/src/unix/async.c b/deps/uv/src/unix/async.c
index e1805c3237..24965a3c84 100644
--- a/deps/uv/src/unix/async.c
+++ b/deps/uv/src/unix/async.c
@@ -160,6 +160,7 @@ static void uv__async_io(uv_loop_t* loop, uv__io_t* w, unsigned int events) {
     if (h->async_cb == NULL)
       continue;
 
+    assert(!(h->flags & UV_HANDLE_CLOSED));
     h->async_cb(h);
   }
 }

@addaleax - will this above assertion answer your question ? If so, no is the answer, as I got core dumps again with the same crash context, passing the assertion. If not, let me know if you have another proposal that I can check.

@gireeshpunathil
Copy link
Member

ok, I even tried this:

diff --git a/deps/uv/src/unix/async.c b/deps/uv/src/unix/async.c
index e1805c3237..3f16b37666 100644
--- a/deps/uv/src/unix/async.c
+++ b/deps/uv/src/unix/async.c
@@ -160,6 +160,8 @@ static void uv__async_io(uv_loop_t* loop, uv__io_t* w, unsigned int events) {
     if (h->async_cb == NULL)
       continue;
 
+    assert(!uv_is_closing(h));
     h->async_cb(h);
   }
 }

and still it does not hit the assertion, yet it crashes in the old way.

@addaleax
Copy link
Member

@gireeshpunathil Yeah, that does answer it – but it’s worrying. How hard is it to reproduce this for you? Would you be able to make the test crash while running unter valgrind?

@gireeshpunathil
Copy link
Member

@addaleax - sure. usually it crashes like once in ten runs or so, so not hard at all. will try with valgrind.

@gireeshpunathil
Copy link
Member

@addaleax - we got valgrind output in the bad sequence. here it is.

@gireeshpunathil
Copy link
Member

@addaleax - I tried doing a bisect, but got inconclusive and contradicting results (mostly because the failure is not deterministic), so deeming it as non-trustworthy, and not posting here.

I am taking another approach of unwinding the history of src/node_messaging.cc and doing a manual trial and error. Right now I am suspecting bcf73d6 , but I should admit that I did not follow the valgrind report, other than that we destroyed transferrable objects after deserializing messages, and those objects were accessed in the callbacks.

addaleax added a commit to addaleax/node that referenced this issue Jul 19, 2021
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we
should not delete the `HandleWrap` outright, but instead close
it and then delete it only once the libuv close callback has been
called.

Based on the valgrind output from the issue below, this has a
good chance of fixing it.

Fixes: nodejs#39036
@addaleax
Copy link
Member

other than that we destroyed transferrable objects after deserializing messages, and those objects were accessed in the callbacks.

@gireeshpunathil Thanks, the valgrind report looks very helpful. In particular, we can tell from the valgrind report that the problem is that a MessagePort object was deleted while it was being transferred to another MessagePort that was unable to deserialize the message as a whole (because the worker was shutting down, I presume). That should not be happening – the transferred MessagePort should be closed, not outright deleted.

I’ve opened a potential fix based on this in #39441 (but haven’t confirmed that it actually fixes this issue here). Thank you for the helpful debugging!

@gireeshpunathil
Copy link
Member

@addaleax - that is very nice, thanks!

So what you are saying is that a MessagePort should not be deleted outright if the transfer fails; what if the transfer succeeds? Does the new MessagePort created (the deserialized copy) makes the old one safe to delete? What causes the stale objects to escape into libuv? I have still a gap in my understanding here, it would be great if you can fill me up.

@addaleax
Copy link
Member

So what you are saying is that a MessagePort should not be deleted outright if the transfer fails

👍

what if the transfer succeeds? Does the new MessagePort created (the deserialized copy) makes the old one safe to delete?

There’s no copy, that’s the point of transfering (as opposed to cloning). :) If the transfer succeeds, there’s a new MessagePort object on the receiving side – but that’s fine, because then the receiving side can manage its lifetime like any other MessagePort.

What causes the stale objects to escape into libuv?

I’m not sure what you mean by ‘escape’ – the receiving side creates a MessagePort object, which has an embedded uv_async_t, which is initialized and registered with libuv. The problem is that there is no uv_close() call that would tell libuv to un-register it again when it is clear that receiving the MessagePort will fail.

I have still a gap in my understanding here, it would be great if you can fill me up.

I guess it might help to draw out the timeline of events in the crashing case:

  1. Thread A and B are connected by a messaging channel, and thread A also owns another messaging channel consisting of MessagePorts X and Y
  2. Thread A sends X instance to B over their common channel
  3. Thread B starts receiving the message, and creates a MessagePort instance because the deserialized message contains one, taking over the state from X
  4. Thread B is terminated by thread A
  5. Thread A sends a message from Y to X
  6. Thread B sees that deserialization is failing because it’s terminating, and starts deleting all objects in the received message that would otherwise leak memory
  7. Thread B deletes the MessagePort it created in step 3 (but does not close it)
  8. Thread B sees the message sent in step 7 on the event loop, and tries to deliver it to X
  9. Boom

@gireeshpunathil
Copy link
Member

ok, thanks @addaleax for the detailed explanation, this is really useful!

targos pushed a commit that referenced this issue Jul 25, 2021
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we
should not delete the `HandleWrap` outright, but instead close
it and then delete it only once the libuv close callback has been
called.

Based on the valgrind output from the issue below, this has a
good chance of fixing it.

Fixes: #39036

PR-URL: #39441
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
richardlau pushed a commit that referenced this issue Jul 29, 2021
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we
should not delete the `HandleWrap` outright, but instead close
it and then delete it only once the libuv close callback has been
called.

Based on the valgrind output from the issue below, this has a
good chance of fixing it.

Fixes: #39036

PR-URL: #39441
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
BethGriggs pushed a commit that referenced this issue Jul 29, 2021
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we
should not delete the `HandleWrap` outright, but instead close
it and then delete it only once the libuv close callback has been
called.

Based on the valgrind output from the issue below, this has a
good chance of fixing it.

Fixes: #39036

PR-URL: #39441
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
targos pushed a commit that referenced this issue Sep 4, 2021
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we
should not delete the `HandleWrap` outright, but instead close
it and then delete it only once the libuv close callback has been
called.

Based on the valgrind output from the issue below, this has a
good chance of fixing it.

Fixes: #39036

PR-URL: #39441
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants