Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Opencensus and fix Jepsen bank test #2764

Merged
merged 12 commits into from
Nov 20, 2018
Merged

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented Nov 20, 2018

Got it. Fixed it. Squashed it. Done with it. Tamed it. Time to bask in the glory of victory!


This change is Reviewable

@manishrjain manishrjain merged commit 3b6d817 into master Nov 20, 2018
@manishrjain manishrjain deleted the mrjn/jepsen-zero branch November 20, 2018 23:02
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
Got it. Fixed it. Squashed it. Done with it. Tamed it. Time to bask in the glory of victory!

Fixed Jepsen bank test violation during a network partition. The violation was happening because:
1. Current Zero leader receives a commit request and assigns a commit timestamp.
2. Gets partitioned out, so the proposal gets blocked.
3. Another Zero becomes the leader and renews the txn timestamp lease, starting at a much higher number (previous lease + lease bandwidth of 10K).
4. The new leader services new txn requests, which advances all Alphas to a much higher MaxApplied timestamp. 
5. The previous leader, who is now the follower, retries the old commit proposal and succeeds. This causes 2 issues.
  a) A later txn read doesn't see this commit, and advances.
  b) Alpha rejects a write on a posting list key, which already has a new commit at higher ts.
Both of the scenarios caused bank txn test violation.

Open Census and Dgraph debug disect was instrumental in determining the cause of this violation. The tell-tale sign was noticing a /Commit timeout of one of the penultimate commits, to the violating commit.

Fixes:

0. Use OpenCensus as a torch to light every path that a txn took, to determine the cause.
1. Do not allow a context deadline when proposing a txn commit.
2. Do not allow Zero follower to propagate proposals to Zero leader.

Tested this PR to fix issue dgraph-io#2391 . Tested with partition-ring and clock-skew nemesis in a 10-node cluster. More testing is needed around predicate moves.

Changelog:

* Trial: Do not purge anything (needs to be refactored and reintroduced before a release).
* More debugging, more tracing for Jepsen.
* Opencensus in Zero.
* One fix for Jepsen test, which ensures that a context deadline cannot just cancel a txn proposal. Need some refactoring of how Zero gets membership updates, now that we need Zero to not forward proposals to the leader.
* Update Raft lib, so we have access to the feature to disallow Raft proposal forwarding.
* Update raftpb proto package as well
* Dirty changes to ensure that Zero followers don't forward proposals to Zero leader.
* Various Opencensus integration changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant