Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async_tcp_client: remove callbacks if connection was not closed #35410

Merged
merged 8 commits into from
Aug 1, 2024

Conversation

ohadvano
Copy link
Contributor

@ohadvano ohadvano commented Jul 24, 2024

Commit Message: when the AsyncTcpClient is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ClientConnection object would also be destroyed, causing raiseEvent to be called back to AsyncTcpClient while it is being destroyed
Additional Description: Caught with the following stack trace:

Caught Segmentation fault, suspect faulting address 0x0
Backtrace (use tools/stack_decode.py to get line numbers):
Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL
#0: __restore_rt [0x7d80ab903420]
#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b]
#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a]
#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d]
#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9]
#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c]
#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c]
#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c]
#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e]
#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247]
#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350]
#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf]
#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a]
#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a]
#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x5831364e88d5]
#15: start_thread [0x7d80ab8f7609]

Risk Level: low
Testing: unit tests
Docs Changes: none
Release Notes: none
Platform Specific Features: none

Signed-off-by: Ohad Vano <ohadvano@gmail.com>
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #35410 was opened by ohadvano.

see: more, trace.

Signed-off-by: Ohad Vano <ohadvano@gmail.com>
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
@ohadvano ohadvano marked this pull request as ready for review July 24, 2024 10:51
@ohadvano
Copy link
Contributor Author

/assign @botengyao

Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, and thanks for the fix!

Do you mind also adding an integration test to tcp_async_client_integration_test.cc?

@ohadvano
Copy link
Contributor Author

I don't think this is testable in integration tests, but if you find a way to, please lmk.
The problem is that in integration tests, FakeUpstream is always being destroyed before listeners and filters. So when I try this in an integration test, first FakeUpstream is destroyed, then onEvent is called with RemoteClose while the AsyncTcpClient is still active. Only after that, AsyncTcpClient is destroyed, and by then the connection_ is already null, as expected.

@botengyao
Copy link
Member

The existing tcp async client integration test is using a fake filter, and the life time of the async client in the test filter is the one as the downstream connection which has the longest lifetime. To simulate this scenario, we can create a customized usage that is owning this async client and then destroy the object at first, one example is open_telemetry/access_log_integration_test.cc, it needs some work and I am fine without this.

I think this can be simulated by unit tests with the control of the life time.

@ohadvano
Copy link
Contributor Author

ohadvano commented Jul 24, 2024

one example is open_telemetry/access_log_integration_test.cc, it needs some work and I am fine without this.

Are you referring to the close() and cleanup() sections, if so, then this is not enough. This would just do a regular close of the upstream connection, and there won't be a crash.
To really simulate the issue, the AsyncTcpClient itself needs to be deleted before FakeUpstream is being destroyed, while still having an active upstream connection.

I think this can be simulated by unit tests with the control of the life time.

The changes I made to the unit tests will test this. Without the fix, many of the unit tests in the file will fail (those that don't explicitly call close() on the client)

Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the close() and cleanup() sections
To really simulate the issue, the AsyncTcpClient itself needs to be deleted before FakeUpstream is being destroyed, while still having an active upstream connection

Nope, like you described, we can establish an object that is using this async client to manually destroy it at first.

LGTM, and deferred to @ggreenway for another pass.

@botengyao
Copy link
Member

/assign @ggreenway

Signed-off-by: Ohad Vano <ohadvano@gmail.com>
connection_->removeConnectionCallbacks(*this);
}

close(Network::ConnectionCloseType::NoFlush);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more for the reason that the close() is added here? I don't see from your initial fix, and I feel a user is possible to close the connection with other type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is because if we don't close it, tests fail because of the ASSERT in:

ConnectionImpl::~ConnectionImpl() {
ASSERT(!socket_->isOpen() && delayed_close_timer_ == nullptr,
"ConnectionImpl was unexpectedly torn down without being closed.");
// In general we assume that owning code has called close() previously to the destructor being
// run. This generally must be done so that callbacks run in the correct context (vs. deferred
// deletion). Hence the assert above. However, call close() here just to be completely sure that
// the fd is closed and make it more likely that we crash from a bad close callback.
close(ConnectionCloseType::NoFlush);
}

The comment in that code section explains the expectation out of the owner of the ConnectionImpl:
In general we assume that owning code has called close() previously to the destructor being run.

Copy link
Member

@botengyao botengyao Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is to mitigate the wrong usage of this client in the test filter to simulate the callback issue, and ideally this should not happen. It needs to be fixed in the usage part. FWIW, the original version of fix LGTM, and we can change how we test it or remove the newly added tests, either works for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes differ in the close call, is there a reason why we shouldn't include it, regardless of the testing?
Removing the newly added test meaning no integration test coverage for this case.

Copy link
Member

@botengyao botengyao Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay) and then following with a close(NoFlush) when the connection is still in the defer deletion list, the previous one will be overrode.

Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@botengyao
Copy link
Member

/wait

@ohadvano
Copy link
Contributor Author

/retest

Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks neat, overall LGTM module one concern for the recent change.

connection_->removeConnectionCallbacks(*this);
}

close(Network::ConnectionCloseType::NoFlush);
Copy link
Member

@botengyao botengyao Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay) and then following with a close(NoFlush) when the connection is still in the defer deletion list, the previous one will be overrode.

@ohadvano
Copy link
Contributor Author

@botengyao thanks for the review. I think what you're saying makes sense. The close is essentially a wrapper around if (connection_), however, if I understand correctly, there's a scenario where the owner is calling close, then the connection_ object is still valid because the onEvent did not fire the close event, so the connection_ object is yet to be deferred deleted.
In that case, if we get to the dtor, close would be called twice.

My suggestion is modifying close here:

void AsyncTcpClientImpl::close(Network::ConnectionCloseType type) {

So the condition would be if (connection_ && !closing_), or some similar concept, wdyt?

@botengyao
Copy link
Member

So the condition would be if (connection_ && !closing_), or some similar concept, wdyt?

@ohadvano, sgtm, a close boolean works, thanks!

@ohadvano
Copy link
Contributor Author

ohadvano commented Aug 1, 2024

ping @ggreenway @botengyao

Copy link
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ggreenway ggreenway merged commit 0e03184 into envoyproxy:main Aug 1, 2024
51 checks passed
@ohadvano ohadvano deleted the fix_async_tcp_client branch August 2, 2024 10:56
martinduke pushed a commit to martinduke/envoy that referenced this pull request Aug 8, 2024
…yproxy#35410)

When the ``AsyncTcpClient`` is being destroyed but it
also has an active client connection, there's a crash since during the
instance destruction, the ``ClientConnection`` object would also be
destroyed, causing ``raiseEvent`` to be called back to
``AsyncTcpClient`` while it is being destroyed

Caught with the following stack trace:
```
Caught Segmentation fault, suspect faulting address 0x0
Backtrace (use tools/stack_decode.py to get line numbers):
Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL
#0: __restore_rt [0x7d80ab903420]
envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b]
envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a]
envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d]
envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9]
envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c]
envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c]
envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c]
envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e]
envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247]
envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350]
envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf]
envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a]
envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a]
envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5]
envoyproxy#15: start_thread [0x7d80ab8f7609]
```
Risk Level: low
Testing: unit tests
Docs Changes: none
Release Notes: none
Platform Specific Features: none

---------

Signed-off-by: Ohad Vano <ohadvano@gmail.com>
Signed-off-by: Martin Duke <martin.h.duke@gmail.com>
asingh-g pushed a commit to asingh-g/envoy that referenced this pull request Aug 20, 2024
…yproxy#35410)

When the ``AsyncTcpClient`` is being destroyed but it
also has an active client connection, there's a crash since during the
instance destruction, the ``ClientConnection`` object would also be
destroyed, causing ``raiseEvent`` to be called back to
``AsyncTcpClient`` while it is being destroyed

Caught with the following stack trace:
```
Caught Segmentation fault, suspect faulting address 0x0
Backtrace (use tools/stack_decode.py to get line numbers):
Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL
#0: __restore_rt [0x7d80ab903420]
envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b]
envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a]
envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d]
envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9]
envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c]
envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c]
envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c]
envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e]
envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247]
envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350]
envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf]
envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a]
envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a]
envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5]
envoyproxy#15: start_thread [0x7d80ab8f7609]
```
Risk Level: low
Testing: unit tests
Docs Changes: none
Release Notes: none
Platform Specific Features: none

---------

Signed-off-by: Ohad Vano <ohadvano@gmail.com>
Signed-off-by: asingh-g <abhisinghx@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants