Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream Raft Messages and Fix Check Quorum #3138

Merged
merged 12 commits into from
Mar 20, 2019
Merged

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented Mar 14, 2019

Instead of sending the Raft messages, one message per gRPC call, this PR creates a one-way stream between the sender and the receiver. Each messages gets pushed to a channel. We use smart batching to pick up as many messages as we can and send them over the stream in order. If we see connection issues etc., there are mechanisms in place to recreate the stream.

Another issue I saw was related to Zero being unable to maintain quorum. It was because of an unbuffered channel in checkQuorum asking for read index, which didn't allow multiple requests to be pushed into one batch causing check quorum to fail even with one second timeout. After allocating a buffered channel, all the check quorum requests finish within a millisecond, rarely going above 7ms in my tests.

This change is Reviewable

@manishrjain manishrjain marked this pull request as ready for review March 15, 2019 00:32
Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @manishrjain)


conn/node.go, line 379 at r2 (raw file):

/ Exit after a thousand tries

"Exit after at least a thousand tries or ten seconds" to match what the code is actually doing.


conn/node.go, line 386 at r2 (raw file):

So that we print error only a few times

nit: "Update lastLog so that we print the error ... "

Copy link
Contributor Author

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 5 files reviewed, 2 unresolved discussions (waiting on @martinmr)


conn/node.go, line 379 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
/ Exit after a thousand tries

"Exit after at least a thousand tries or ten seconds" to match what the code is actually doing.

Done.


conn/node.go, line 386 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
So that we print error only a few times

nit: "Update lastLog so that we print the error ... "

Done.

@manishrjain manishrjain changed the title Stream Raft Messages Stream Raft Messages and Fix Check Quorum Mar 20, 2019
@manishrjain manishrjain merged commit a57dfc0 into master Mar 20, 2019
@manishrjain manishrjain deleted the mrjn/stream-raft-messages branch March 20, 2019 01:16
manishrjain added a commit that referenced this pull request Mar 20, 2019
Instead of sending the Raft messages, one message per gRPC call, this PR creates a one-way stream between the sender and the receiver. Each messages gets pushed to a channel. We use smart batching to pick up as many messages as we can and send them over the stream in order. If we see connection issues etc., there are mechanisms in place to recreate the stream.

Another issue I saw was related to Zero being unable to maintain quorum. It was because of an unbuffered channel in checkQuorum asking for read index, which didn't allow multiple requests to be pushed into one batch causing check quorum to fail even with one second timeout. After allocating a buffered channel, all the check quorum requests finish within a millisecond, rarely going above 7ms in my tests.

Changes:

* Stream raft messages instead of sending them one by one.
* Set duration to 10s
* Zero checkQuorum works well now
* Martin's comments
* Adjust timeouts in contexts, so the deeper one has shorter timeout and the outer one has longer.
* Batch up multiple Raft messages from channel and send them in one request.
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
Instead of sending the Raft messages, one message per gRPC call, this PR creates a one-way stream between the sender and the receiver. Each messages gets pushed to a channel. We use smart batching to pick up as many messages as we can and send them over the stream in order. If we see connection issues etc., there are mechanisms in place to recreate the stream.

Another issue I saw was related to Zero being unable to maintain quorum. It was because of an unbuffered channel in checkQuorum asking for read index, which didn't allow multiple requests to be pushed into one batch causing check quorum to fail even with one second timeout. After allocating a buffered channel, all the check quorum requests finish within a millisecond, rarely going above 7ms in my tests. 

Changes:

* Stream raft messages instead of sending them one by one.
* Set duration to 10s
* Zero checkQuorum works well now
* Martin's comments
* Adjust timeouts in contexts, so the deeper one has shorter timeout and the outer one has longer.
* Batch up multiple Raft messages from channel and send them in one request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants