-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[close #596] fix tls reloading encouter slow thread npe #597
[close #596] fix tls reloading encouter slow thread npe #597
Conversation
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #597 +/- ##
============================================
+ Coverage 33.82% 34.13% +0.31%
- Complexity 1354 1371 +17
============================================
Files 270 270
Lines 17174 17244 +70
Branches 1956 1958 +2
============================================
+ Hits 5809 5887 +78
+ Misses 10751 10746 -5
+ Partials 614 611 -3
Continue to review full report at Codecov.
|
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@@ -213,14 +214,15 @@ public ChannelFactory( | |||
this.keepaliveTimeout = keepaliveTimeout; | |||
this.idleTimeout = idleTimeout; | |||
this.certContext = new JksContext(jksKeyPath, jksKeyPassword, jksTrustPath, jksTrustPassword); | |||
reloadSslContext(); | |||
} | |||
|
|||
@VisibleForTesting | |||
public boolean reloadSslContext() { | |||
if (certContext != null) { | |||
SslContextBuilder newBuilder = certContext.reload(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a race between isModified
check and the rest of initialization logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reload in constructor can help mitigating NPE. But the race can make later certificate reload not working as expected.
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@sunxiaoguang PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, all concurrency control applies to "getting the latest SSLContext builder" and then "use it to initialize the connection pool correctly". But what if there are two consecutive cert change happening very quickly? Then we might use two different SSLContext builder to initialize the pool since when the second reload
calls connpool.clear
, the first call to getChannel
might still not start to initialize the pool.
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
sslContext = sslContextBuilder.build(); | ||
} catch (SSLException e) { | ||
logger.error("create ssl context failed!", e); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return null
or raise exception
? What should the user expect and how they could restore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to raise exception
.
if (reloadSslContext()) { | ||
logger.info("invalidate connection pool"); | ||
@VisibleForTesting | ||
public synchronized ManagedChannel reload(String address, HostMapping mapping) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried about the performance of getChannel(). Since renewing cert is a very low frequent operation, it's a big cost to sync and invoke a disk IO every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And from my point of view, client-java needs to have a high throughput when working as a cache client.
It's a situation quite like service discovery in RPC client, in which we wouldn't refresh every time getChannel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an optimization to avoid sync disk IO, using a WatchService
to notify the ChannelFactory
if the cert is changed. A TODO is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you enhance it recently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
…oad-slow-thread-npe
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
p.register(watchService, StandardWatchEventKinds.ENTRY_MODIFY); | ||
|
||
WatchKey key; | ||
while ((key = watchService.take()) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which condition would equal null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.oracle.com/javase/7/docs/api/java/nio/file/WatchService.html#take() should always return notnull seem this null check is redundant.
logger.info("detected file change: {}", event.context()); | ||
if (event.context().toString().equals(target)) { | ||
changed = true; | ||
break OUTER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why break here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the certs in the base path change in one shot, multiple events will be triggered. If we consume all of these events one by one, multiple reload will be triggered. This break
controls a list of events that will trigger only once reload.
private final CertContext certContext; | ||
|
||
@VisibleForTesting | ||
public final ConcurrentHashMap<Pair<SslContextBuilder, String>, ManagedChannel> connPool = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key of Map doesn't implement hashCode() and equals(). Will it bring some potential bugs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wrap SslContextBuilder into a type with epoch.
@shiyuhang0 PTAL |
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@xuanyu66 PTAL |
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@iosmanthus Plz check the failed CI tests |
WatchKey key = watchService.take(); | ||
boolean changed = false; | ||
OUTER: | ||
for (WatchEvent<?> event : key.pollEvents()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key.pollEvents() is non-blocking, so the method is keeping running, which may waste CPU resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method will not keep running, once the events were taken out of the queue, this block will exist and wait at watchService.take()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
private void cleanExpireConn(Collection<ManagedChannel> pending) { | ||
for (ManagedChannel channel : pending) { | ||
logger.info("cleaning expire channels"); | ||
channel.shutdownNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use shutdownNow()? Maybe some channel is being used right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the channel is used right now, no matter calling shutdown or shutdownNow
, the user of the channel will receive an exception, a retry is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
// Check all the modification of the `targets`. | ||
// If one of them changed, means to need reload. | ||
for (int i = 0; i < targets.size(); i++) { | ||
long lastModified = targets.get(i).lastModified(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastModified
might raise error due to permission. Maybe we need to catch the whole needReload
body to give a meaningful error log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
@xuanyu66, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. You are not a reviewer or committer or co-leader or leader. |
@shiyuhang0, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. You are not a reviewer or committer or co-leader or leader. |
@zhangyangyu, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. You are not a reviewer or committer or co-leader or leader. |
|
||
private int connRecycleTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be final as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/merge |
/run-all-tests |
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
548f243
/run-all-tests |
1 similar comment
/run-all-tests |
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus <myosmanthustree@gmail.com>
Signed-off-by: iosmanthus myosmanthustree@gmail.com
What problem does this PR solve?
Issue Number: close #596
Problem Description:
getChannel
will encounter NPE for the slow thread might meet a null if the cert is unmodified.What is changed and how does it work?
Add an initial state for
SslContextBuilder
Check List for Tests
This PR has been tested by at least one of the following methods:
Related changes