async_tcp_client: remove callbacks if connection was not closed #35410

ohadvano · 2024-07-24T08:56:52Z

Commit Message: when the AsyncTcpClient is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ClientConnection object would also be destroyed, causing raiseEvent to be called back to AsyncTcpClient while it is being destroyed
Additional Description: Caught with the following stack trace:

Caught Segmentation fault, suspect faulting address 0x0
Backtrace (use tools/stack_decode.py to get line numbers):
Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL
#0: __restore_rt [0x7d80ab903420]
#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b]
#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a]
#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d]
#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9]
#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c]
#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c]
#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c]
#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e]
#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247]
#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350]
#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf]
#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a]
#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a]
#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x5831364e88d5]
#15: start_thread [0x7d80ab8f7609]

Risk Level: low
Testing: unit tests
Docs Changes: none
Release Notes: none
Platform Specific Features: none

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

repokitteh-read-only · 2024-07-24T08:56:56Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #35410 was opened by ohadvano.

see: more, trace.

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

ohadvano · 2024-07-24T10:51:22Z

/assign @botengyao

botengyao

Overall LGTM, and thanks for the fix!

Do you mind also adding an integration test to tcp_async_client_integration_test.cc?

ohadvano · 2024-07-24T15:32:16Z

I don't think this is testable in integration tests, but if you find a way to, please lmk.
The problem is that in integration tests, FakeUpstream is always being destroyed before listeners and filters. So when I try this in an integration test, first FakeUpstream is destroyed, then onEvent is called with RemoteClose while the AsyncTcpClient is still active. Only after that, AsyncTcpClient is destroyed, and by then the connection_ is already null, as expected.

botengyao · 2024-07-24T16:04:14Z

The existing tcp async client integration test is using a fake filter, and the life time of the async client in the test filter is the one as the downstream connection which has the longest lifetime. To simulate this scenario, we can create a customized usage that is owning this async client and then destroy the object at first, one example is open_telemetry/access_log_integration_test.cc, it needs some work and I am fine without this.

I think this can be simulated by unit tests with the control of the life time.

ohadvano · 2024-07-24T16:59:28Z

one example is open_telemetry/access_log_integration_test.cc, it needs some work and I am fine without this.

Are you referring to the close() and cleanup() sections, if so, then this is not enough. This would just do a regular close of the upstream connection, and there won't be a crash.
To really simulate the issue, the AsyncTcpClient itself needs to be deleted before FakeUpstream is being destroyed, while still having an active upstream connection.

I think this can be simulated by unit tests with the control of the life time.

The changes I made to the unit tests will test this. Without the fix, many of the unit tests in the file will fail (those that don't explicitly call close() on the client)

botengyao

Are you referring to the close() and cleanup() sections
To really simulate the issue, the AsyncTcpClient itself needs to be deleted before FakeUpstream is being destroyed, while still having an active upstream connection

Nope, like you described, we can establish an object that is using this async client to manually destroy it at first.

LGTM, and deferred to @ggreenway for another pass.

botengyao · 2024-07-24T17:08:22Z

/assign @ggreenway

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

botengyao · 2024-07-25T16:29:06Z

source/common/tcp/async_tcp_client_impl.cc

+    connection_->removeConnectionCallbacks(*this);
+  }
+
+  close(Network::ConnectionCloseType::NoFlush);


Could you elaborate more for the reason that the close() is added here? I don't see from your initial fix, and I feel a user is possible to close the connection with other type.

The reason is because if we don't close it, tests fail because of the ASSERT in:

envoy/source/common/network/connection_impl.cc

Lines 116 to 125 in b1e3351

ConnectionImpl::~ConnectionImpl() {

ASSERT(!socket_->isOpen() && delayed_close_timer_ == nullptr,

"ConnectionImpl was unexpectedly torn down without being closed.");

// In general we assume that owning code has called close() previously to the destructor being

// run. This generally must be done so that callbacks run in the correct context (vs. deferred

// deletion). Hence the assert above. However, call close() here just to be completely sure that

// the fd is closed and make it more likely that we crash from a bad close callback.

close(ConnectionCloseType::NoFlush);

}

The comment in that code section explains the expectation out of the owner of the ConnectionImpl:
In general we assume that owning code has called close() previously to the destructor being run.

I feel this is to mitigate the wrong usage of this client in the test filter to simulate the callback issue, and ideally this should not happen. It needs to be fixed in the usage part. FWIW, the original version of fix LGTM, and we can change how we test it or remove the newly added tests, either works for me.

The changes differ in the close call, is there a reason why we shouldn't include it, regardless of the testing?
Removing the newly added test meaning no integration test coverage for this case.

I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay) and then following with a close(NoFlush) when the connection is still in the defer deletion list, the previous one will be overrode.

botengyao

Thanks for fixing this!

botengyao · 2024-07-29T17:19:01Z

/wait

ohadvano · 2024-07-30T07:26:00Z

/retest

botengyao

Thanks, this looks neat, overall LGTM module one concern for the recent change.

botengyao · 2024-07-30T16:38:23Z

source/common/tcp/async_tcp_client_impl.cc

+    connection_->removeConnectionCallbacks(*this);
+  }
+
+  close(Network::ConnectionCloseType::NoFlush);


I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay) and then following with a close(NoFlush) when the connection is still in the defer deletion list, the previous one will be overrode.

ohadvano · 2024-07-30T18:24:54Z

@botengyao thanks for the review. I think what you're saying makes sense. The close is essentially a wrapper around if (connection_), however, if I understand correctly, there's a scenario where the owner is calling close, then the connection_ object is still valid because the onEvent did not fire the close event, so the connection_ object is yet to be deferred deleted.
In that case, if we get to the dtor, close would be called twice.

My suggestion is modifying close here:

envoy/source/common/tcp/async_tcp_client_impl.cc

Line 71 in 90761ad

void AsyncTcpClientImpl::close(Network::ConnectionCloseType type) {

So the condition would be if (connection_ && !closing_), or some similar concept, wdyt?

botengyao · 2024-07-30T19:39:53Z

So the condition would be if (connection_ && !closing_), or some similar concept, wdyt?

@ohadvano, sgtm, a close boolean works, thanks!

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

ohadvano · 2024-08-01T13:10:38Z

ping @ggreenway @botengyao

botengyao

LGTM, thanks!

ggreenway

LGTM

…yproxy#35410) When the ``AsyncTcpClient`` is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ``ClientConnection`` object would also be destroyed, causing ``raiseEvent`` to be called back to ``AsyncTcpClient`` while it is being destroyed Caught with the following stack trace: ``` Caught Segmentation fault, suspect faulting address 0x0 Backtrace (use tools/stack_decode.py to get line numbers): Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL #0: __restore_rt [0x7d80ab903420] envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b] envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a] envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d] envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9] envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c] envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c] envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c] envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e] envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247] envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350] envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf] envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a] envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a] envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5] envoyproxy#15: start_thread [0x7d80ab8f7609] ``` Risk Level: low Testing: unit tests Docs Changes: none Release Notes: none Platform Specific Features: none --------- Signed-off-by: Ohad Vano <ohadvano@gmail.com> Signed-off-by: Martin Duke <martin.h.duke@gmail.com>

…yproxy#35410) When the ``AsyncTcpClient`` is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ``ClientConnection`` object would also be destroyed, causing ``raiseEvent`` to be called back to ``AsyncTcpClient`` while it is being destroyed Caught with the following stack trace: ``` Caught Segmentation fault, suspect faulting address 0x0 Backtrace (use tools/stack_decode.py to get line numbers): Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL #0: __restore_rt [0x7d80ab903420] envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b] envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a] envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d] envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9] envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c] envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c] envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c] envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e] envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247] envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350] envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf] envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a] envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a] envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5] envoyproxy#15: start_thread [0x7d80ab8f7609] ``` Risk Level: low Testing: unit tests Docs Changes: none Release Notes: none Platform Specific Features: none --------- Signed-off-by: Ohad Vano <ohadvano@gmail.com> Signed-off-by: asingh-g <abhisinghx@google.com>

test

a8d6646

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

ohadvano added 3 commits July 24, 2024 12:37

fix format

6722746

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

test2

d39d48f

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

raise event in dtor

51e781f

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

ohadvano marked this pull request as ready for review July 24, 2024 10:51

repokitteh-read-only bot assigned botengyao Jul 24, 2024

botengyao reviewed Jul 24, 2024

View reviewed changes

botengyao approved these changes Jul 24, 2024

View reviewed changes

repokitteh-read-only bot assigned ggreenway Jul 24, 2024

add integration tests

8d32961

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

botengyao reviewed Jul 25, 2024

View reviewed changes

botengyao reviewed Jul 29, 2024

View reviewed changes

repokitteh-read-only bot added the waiting label Jul 29, 2024

Merge branch 'envoyproxy:main' into fix_async_tcp_client

daf1648

repokitteh-read-only bot removed the waiting label Jul 30, 2024

botengyao reviewed Jul 30, 2024

View reviewed changes

ohadvano and others added 2 commits July 31, 2024 10:05

add closing condition

d51122c

Signed-off-by: Ohad Vano <ohadvano@gmail.com>

Merge branch 'envoyproxy:main' into fix_async_tcp_client

26e2439

botengyao approved these changes Aug 1, 2024

View reviewed changes

ggreenway approved these changes Aug 1, 2024

View reviewed changes

ggreenway merged commit 0e03184 into envoyproxy:main Aug 1, 2024
51 checks passed

ohadvano deleted the fix_async_tcp_client branch August 2, 2024 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async_tcp_client: remove callbacks if connection was not closed #35410

async_tcp_client: remove callbacks if connection was not closed #35410

ohadvano commented Jul 24, 2024 •

edited

Loading

repokitteh-read-only bot commented Jul 24, 2024

ohadvano commented Jul 24, 2024

botengyao left a comment

ohadvano commented Jul 24, 2024

botengyao commented Jul 24, 2024

ohadvano commented Jul 24, 2024 •

edited

Loading

botengyao left a comment •

edited

Loading

botengyao commented Jul 24, 2024

botengyao Jul 25, 2024

ohadvano Jul 25, 2024

botengyao Jul 29, 2024 •

edited

Loading

ohadvano Jul 29, 2024

botengyao Jul 30, 2024 •

edited

Loading

botengyao left a comment

botengyao commented Jul 29, 2024

ohadvano commented Jul 30, 2024

botengyao left a comment

botengyao Jul 30, 2024 •

edited

Loading

ohadvano commented Jul 30, 2024

botengyao commented Jul 30, 2024

ohadvano commented Aug 1, 2024

botengyao left a comment

ggreenway left a comment

	ConnectionImpl::~ConnectionImpl() {
	ASSERT(!socket_->isOpen() && delayed_close_timer_ == nullptr,
	"ConnectionImpl was unexpectedly torn down without being closed.");

	// In general we assume that owning code has called close() previously to the destructor being
	// run. This generally must be done so that callbacks run in the correct context (vs. deferred
	// deletion). Hence the assert above. However, call close() here just to be completely sure that
	// the fd is closed and make it more likely that we crash from a bad close callback.
	close(ConnectionCloseType::NoFlush);
	}

async_tcp_client: remove callbacks if connection was not closed #35410

async_tcp_client: remove callbacks if connection was not closed #35410

Conversation

ohadvano commented Jul 24, 2024 • edited Loading

repokitteh-read-only bot commented Jul 24, 2024

ohadvano commented Jul 24, 2024

botengyao left a comment

Choose a reason for hiding this comment

ohadvano commented Jul 24, 2024

botengyao commented Jul 24, 2024

ohadvano commented Jul 24, 2024 • edited Loading

botengyao left a comment • edited Loading

Choose a reason for hiding this comment

botengyao commented Jul 24, 2024

botengyao Jul 25, 2024

Choose a reason for hiding this comment

ohadvano Jul 25, 2024

Choose a reason for hiding this comment

botengyao Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

ohadvano Jul 29, 2024

Choose a reason for hiding this comment

botengyao Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

botengyao left a comment

Choose a reason for hiding this comment

botengyao commented Jul 29, 2024

ohadvano commented Jul 30, 2024

botengyao left a comment

Choose a reason for hiding this comment

botengyao Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

ohadvano commented Jul 30, 2024

botengyao commented Jul 30, 2024

ohadvano commented Aug 1, 2024

botengyao left a comment

Choose a reason for hiding this comment

ggreenway left a comment

Choose a reason for hiding this comment

ohadvano commented Jul 24, 2024 •

edited

Loading

ohadvano commented Jul 24, 2024 •

edited

Loading

botengyao left a comment •

edited

Loading

botengyao Jul 29, 2024 •

edited

Loading

botengyao Jul 30, 2024 •

edited

Loading

botengyao Jul 30, 2024 •

edited

Loading