Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tower 72 fix #930

Closed
wants to merge 5 commits into from
Closed

Conversation

coin1111
Copy link

@coin1111 coin1111 commented Jan 1, 2022

Motivation

The issue #1 was found by Barmalei and he suggested the fix as well. While working on #1, I came across issue #2.

There are couple of problems with tower backlog code:

  1. tower can submit more than 72 proofs per epoch, however any proofs beyond the first 72 will be rejected with MoveAbort error. Looking in the explorer, this is a fairly frequent error and might contribute to a recent network problems when a bunch of miners accumulated proofs and were unable to send them, when then public node was available submitted all at once and kept sending causing MoveAbort errors thus congesting the network
  2. If a proof cannot be submitted , backlog code selects the second proof in the backlog, thus violating sequence of proofs. This behavior makes no sense, since this proof will never be accepted (since previous has not been sent). This problem was also reported during network incident when proofs were sent out of order , very likely due to this bug.
    The fix is to
  3. keep track of how many proofs were sent and do not go over 72
  4. break backlog if submission error is encountered.
    One gotcha found is that count_proofs_in_epoch is not necessarily proofs submitted in the current epoch, thus additional check is required to be able to use this field.

Have you read the [Contributing Guidelines on pull requests] yes

Test Plan

I tested with a single miner with a few pre-computed proofs. The code works as expected.

Related PRs

(If this PR adds or changes functionality, please take some time to update or suggest changes to the docs at https://developers.diem.com, and link to your PR here.)

If targeting a release branch, please fill the below out as well

  • Justification and breaking nature (who does it affect? validators, full nodes, tooling, operators, AOS, etc.)
  • Comprehensive test results that demonstrate the fix working and not breaking existing workflows.
  • Why we must have it for V1 launch.
  • What workarounds and alternative we have if we do not push the PR.

/// Submit a backlog of blocks that may have been mined while network is offline.
/// Likely not more than 1.
// only 72 proofs are allowed per epoch
const MAX_PROOFS_PER_EPOCH: i64 = 72;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be fetched from the move globals so that we don't have duplication?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to catch this from the error code which Move throws when the upper bound is reached. 130108
https://github.com/OLSF/libra/blob/main/language/diem-framework/modules/0L/TowerState.move#L278-L281

More importantly we're missing some error maps of 0L specific errors which the client should identify. See this draft issue I opened. #933

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fetch it from global store

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually after some thinking. I don't think this approach works because with my change tower backlog submission already aborts on any error from submission. So it doesn't matter error code we still should abort submission. This is to solve problem #2 I found with the tower.
And your suggestion to watch for specific error doesn't solve problem #1 in tower - we should not call backlog submission at all if we reached max number of proofs in epoch. Otherwise each tower will keep sending at least 1 proof every 30 mins causing move abort.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing to keep in mind is that MAX_PROOFS_PER_EPOCH will not always be 72. And the users need to have the tower continue to make proofs if there's trouble connecting to network or submitting proofs (tower app should not exit).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do I read number 72 from move config? Is there a code snipped to do that? When backup returns (success or failure) , tower proceeds to mine the next proof which will attempt to be submitted in the next invocation of backlog and so on

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a function in Move which fetches this hardcoded value, but we don't keep it stored in state. So actually we may not be able to get it dynamically through usual APIs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to merge this PR, even if we then will have the allowed proofs per epoch configured at two places. Given the high complexity to make it perfect....

/// Submit a backlog of blocks that may have been mined while network is offline.
/// Likely not more than 1.
// only 72 proofs are allowed per epoch
const MAX_PROOFS_PER_EPOCH: i64 = 72;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to merge this PR, even if we then will have the allowed proofs per epoch configured at two places. Given the high complexity to make it perfect....

@0o-de-lally
Copy link
Collaborator

@coin1111 I've included these changes in #952 since there were other refactors in transaction sending. I'll close this PR for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants