Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[close #596] fix tls reloading encouter slow thread npe #597

Merged
merged 27 commits into from
May 18, 2022

Conversation

iosmanthus
Copy link
Member

Signed-off-by: iosmanthus myosmanthustree@gmail.com

What problem does this PR solve?

Issue Number: close #596

Problem Description:

getChannel will encounter NPE for the slow thread might meet a null if the cert is unmodified.

What is changed and how does it work?

Add an initial state for SslContextBuilder

Check List for Tests

This PR has been tested by at least one of the following methods:

  • Unit test

Related changes

  • Need to cherry-pick the release branch
  • Need to update the documentation

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@codecov
Copy link

codecov bot commented Apr 22, 2022

Codecov Report

Merging #597 (1e077c2) into master (3163793) will increase coverage by 0.31%.
The diff coverage is 67.11%.

@@             Coverage Diff              @@
##             master     #597      +/-   ##
============================================
+ Coverage     33.82%   34.13%   +0.31%     
- Complexity     1354     1371      +17     
============================================
  Files           270      270              
  Lines         17174    17244      +70     
  Branches       1956     1958       +2     
============================================
+ Hits           5809     5887      +78     
+ Misses        10751    10746       -5     
+ Partials        614      611       -3     
Impacted Files Coverage Δ
src/main/java/org/tikv/common/ConfigUtils.java 0.00% <ø> (ø)
src/main/java/org/tikv/common/TiSession.java 70.28% <0.00%> (-0.67%) ⬇️
...va/org/tikv/common/region/StoreHealthyChecker.java 72.15% <0.00%> (+1.63%) ⬆️
...n/java/org/tikv/common/util/ConcreteBackOffer.java 83.78% <0.00%> (-0.62%) ⬇️
src/main/java/org/tikv/common/PDClient.java 56.63% <50.00%> (-2.84%) ⬇️
...java/org/tikv/common/operation/PDErrorHandler.java 38.70% <50.00%> (ø)
.../main/java/org/tikv/common/AbstractGRPCClient.java 42.30% <61.53%> (-3.64%) ⬇️
...main/java/org/tikv/common/util/ChannelFactory.java 62.80% <70.40%> (+6.60%) ⬆️
src/main/java/org/tikv/common/TiConfiguration.java 66.86% <100.00%> (+0.87%) ⬆️
...ain/java/org/tikv/common/util/BackOffFunction.java 82.35% <100.00%> (+0.53%) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3163793...1e077c2. Read the comment docs.

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@@ -213,14 +214,15 @@ public ChannelFactory(
this.keepaliveTimeout = keepaliveTimeout;
this.idleTimeout = idleTimeout;
this.certContext = new JksContext(jksKeyPath, jksKeyPassword, jksTrustPath, jksTrustPassword);
reloadSslContext();
}

@VisibleForTesting
public boolean reloadSslContext() {
if (certContext != null) {
SslContextBuilder newBuilder = certContext.reload();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a race between isModified check and the rest of initialization logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reload in constructor can help mitigating NPE. But the race can make later certificate reload not working as expected.

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@iosmanthus
Copy link
Member Author

@sunxiaoguang PTAL

Copy link
Contributor

@zhangyangyu zhangyangyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, all concurrency control applies to "getting the latest SSLContext builder" and then "use it to initialize the connection pool correctly". But what if there are two consecutive cert change happening very quickly? Then we might use two different SSLContext builder to initialize the pool since when the second reload calls connpool.clear, the first call to getChannel might still not start to initialize the pool.

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
sslContext = sslContextBuilder.build();
} catch (SSLException e) {
logger.error("create ssl context failed!", e);
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return null or raise exception? What should the user expect and how they could restore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to raise exception.

if (reloadSslContext()) {
logger.info("invalidate connection pool");
@VisibleForTesting
public synchronized ManagedChannel reload(String address, HostMapping mapping) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about the performance of getChannel(). Since renewing cert is a very low frequent operation, it's a big cost to sync and invoke a disk IO every time

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And from my point of view, client-java needs to have a high throughput when working as a cache client.
It's a situation quite like service discovery in RPC client, in which we wouldn't refresh every time getChannel

Copy link
Member Author

@iosmanthus iosmanthus Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an optimization to avoid sync disk IO, using a WatchService to notify the ChannelFactory if the cert is changed. A TODO is added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you enhance it recently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PTAL

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
p.register(watchService, StandardWatchEventKinds.ENTRY_MODIFY);

WatchKey key;
while ((key = watchService.take()) != null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which condition would equal null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.oracle.com/javase/7/docs/api/java/nio/file/WatchService.html#take() should always return notnull seem this null check is redundant.

logger.info("detected file change: {}", event.context());
if (event.context().toString().equals(target)) {
changed = true;
break OUTER;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why break here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the certs in the base path change in one shot, multiple events will be triggered. If we consume all of these events one by one, multiple reload will be triggered. This break controls a list of events that will trigger only once reload.

private final CertContext certContext;

@VisibleForTesting
public final ConcurrentHashMap<Pair<SslContextBuilder, String>, ManagedChannel> connPool =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key of Map doesn't implement hashCode() and equals(). Will it bring some potential bugs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wrap SslContextBuilder into a type with epoch.

@xuanyu66
Copy link

@shiyuhang0 PTAL

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@iosmanthus
Copy link
Member Author

@xuanyu66 PTAL

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@xuanyu66
Copy link

@iosmanthus Plz check the failed CI tests

WatchKey key = watchService.take();
boolean changed = false;
OUTER:
for (WatchEvent<?> event : key.pollEvents()) {
Copy link

@xuanyu66 xuanyu66 Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key.pollEvents() is non-blocking, so the method is keeping running, which may waste CPU resource

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method will not keep running, once the events were taken out of the queue, this block will exist and wait at watchService.take()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

private void cleanExpireConn(Collection<ManagedChannel> pending) {
for (ManagedChannel channel : pending) {
logger.info("cleaning expire channels");
channel.shutdownNow();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use shutdownNow()? Maybe some channel is being used right now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the channel is used right now, no matter calling shutdown or shutdownNow, the user of the channel will receive an exception, a retry is expected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
// Check all the modification of the `targets`.
// If one of them changed, means to need reload.
for (int i = 0; i < targets.size(); i++) {
long lastModified = targets.get(i).lastModified();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastModified might raise error due to permission. Maybe we need to catch the whole needReload body to give a meaningful error log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@iosmanthus
Copy link
Member Author

xuanyu66
xuanyu66 previously approved these changes May 17, 2022
@ti-srebot
Copy link
Collaborator

@xuanyu66, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. You are not a reviewer or committer or co-leader or leader.

shiyuhang0
shiyuhang0 previously approved these changes May 17, 2022
@ti-srebot
Copy link
Collaborator

@shiyuhang0, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. You are not a reviewer or committer or co-leader or leader.

zhangyangyu
zhangyangyu previously approved these changes May 17, 2022
@ti-srebot
Copy link
Collaborator

@zhangyangyu, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. You are not a reviewer or committer or co-leader or leader.

sunxiaoguang
sunxiaoguang previously approved these changes May 17, 2022

private int connRecycleTime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be final as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sunxiaoguang
Copy link
Member

/merge

@ti-srebot
Copy link
Collaborator

/run-all-tests

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
sunxiaoguang
sunxiaoguang previously approved these changes May 17, 2022
@iosmanthus
Copy link
Member Author

/run-all-tests

1 similar comment
@sunxiaoguang
Copy link
Member

/run-all-tests

Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@iosmanthus iosmanthus enabled auto-merge (squash) May 17, 2022 15:54
@iosmanthus iosmanthus merged commit 774ee50 into tikv:master May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TLS reloading encounter NPE while testing with multithread
6 participants