Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot make block after skipped x/upgrade #8538

Closed
4 tasks
ethanfrey opened this issue Feb 8, 2021 · 16 comments
Closed
4 tasks

Cannot make block after skipped x/upgrade #8538

ethanfrey opened this issue Feb 8, 2021 · 16 comments
Assignees
Labels
Milestone

Comments

@ethanfrey
Copy link
Contributor

ethanfrey commented Feb 8, 2021

Summary of Bug

We tried doing x/upgrade on musselnet-2 with --upgrade-time (previously only tested with --upgrade-height) and it failed. (@orkunkl will add another issue documenting this)

We then decided to just skip it, using --unsafe-skip-upgrade on our validators (to whom we had staked almost 75% of the voting power). The nodes then came up and starting participating in consensus, but never came to make a new block, even though we have > 67% voting power in pre-commits.

Version

wasmd 0.14.0 based on cosmos-sdk v0.40.1 (we were attempted to upgrade to wasmd 0.15.0 / cosmos-sdk v0.41.0)

Steps to Reproduce

Not sure. I am dumping some output of the rpc nodes in hope they clarify what is going on. Rounds 2 to 31 should have produced a block from my understanding. (Please rename from .json.txt to .json, github did not allow json uploads)

Consensus state:
consensus_state.json.txt

Validators:
validators.json.txt

Validators (previous block):
validators-previous.json.txt

Key example from consensus state (height_vote_set):

        {
          "round": 2,
          "prevotes": [
            "Vote{0:6AF22951B169 445131/02/SIGNED_MSG_TYPE_PREVOTE(Prevote) C7152A70B514 7AEC1669862B @ 2021-02-08T09:56:17.004097353Z}",
            "Vote{1:07D76D2F9D17 445131/02/SIGNED_MSG_TYPE_PREVOTE(Prevote) 000000000000 E0700879FC9E @ 2021-02-08T09:56:17.096206306Z}",
            "Vote{2:AEF865B54EAC 445131/02/SIGNED_MSG_TYPE_PREVOTE(Prevote) C7152A70B514 AE743F755C5B @ 2021-02-08T09:56:17.452091526Z}",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote"
          ],
          "prevotes_bit_array": "BA{11:xxx________} 4034984861364/5380640551050 = 0.75",
          "precommits": [
            "Vote{0:6AF22951B169 445131/02/SIGNED_MSG_TYPE_PRECOMMIT(Precommit) 000000000000 176D5326E96C @ 2021-02-08T09:56:19.206038981Z}",
            "Vote{1:07D76D2F9D17 445131/02/SIGNED_MSG_TYPE_PRECOMMIT(Precommit) 000000000000 EB4DD3D144F8 @ 2021-02-08T09:56:19.10513201Z}",
            "Vote{2:AEF865B54EAC 445131/02/SIGNED_MSG_TYPE_PRECOMMIT(Precommit) 000000000000 C6DBF37D6384 @ 2021-02-08T09:56:19.349238122Z}",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote"
          ],
          "precommits_bit_array": "BA{11:xxx________} 4034984861364/5380640551050 = 0.75"
        }

75% precommits, but it does not advance?


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@aaronc
Copy link
Member

aaronc commented Feb 8, 2021

Well that's alarming.

To begin with I think we should just remove time based upgrades. I'm not sure what's happening there but it seems unreliable and has not been as thoroughly tested.

@aaronc aaronc added this to the v0.42 milestone Feb 8, 2021
@kaustubhkapatral
Copy link
Contributor

We also encountered something similar on Akash network testnet. We did not use the upgrade module and went for manual export and migration to v0.41. We could not get the chain to start at all even though we had 73% consensus at one point.

$ curl -s localhost:51657/consensus_state | jq -r ".result.round_state.height_vote_set[].prevotes_bit_array"
BA{30:xxxx__x___x_x__xxx____x_xx____} 10919167/15735275 = 0.69
BA{30:xxxx__x___x_x__xxx____x_xx____} 10919167/15735275 = 0.69
BA{30:xxxx__x___x_x__xxx____x_xx____} 10919167/15735275 = 0.69
BA{30:xxxx__x___x_x__xxx____x_xx____} 10919167/15735275 = 0.69
BA{30:xxxx__x___x_x__xxx____x_xx____} 10919167/15735275 = 0.69
BA{30:xxxx__x___xxx__xxx____x_xx____} 11439240/15735275 = 0.73
BA{30:xxxx__x___xxx__xxx____x_xx____} 11439240/15735275 = 0.73
BA{30:xxxx__x___xxx__xxx____x_xx____} 11439240/15735275 = 0.73
BA{30:xxxx__x___xxx__xxx____x_xx____} 11439240/15735275 = 0.73
BA{30:xxxx__x___xxx__xxx____x_xx____} 11439240/15735275 = 0.73
BA{30:xxxx__x___xxx__xxx____x_xx____} 11439240/15735275 = 0.73
BA{30:xxxx__x___xxx__xxx____x__x____} 11388340/15735275 = 0.72
...
BA{30:xxxx__x___xx___x_x____x__x__x_} 10827963/15735275 = 0.69
BA{30:xxxx__x___xx___x_x____x__x__x_} 10827963/15735275 = 0.69
BA{30:xxxx__x___xx___x_x____x__x__x_} 10827963/15735275 = 0.69
BA{30:xx_x__x___xx___xxx____x_xx__xx} 9541992/15735275 = 0.61
BA{30:______________________________} 0/15735275 = 0.00

We reverted back to the original chain afterwards.

@ethanfrey
Copy link
Contributor Author

Nice to see this has been reproduced. But no idea on where it comes from.

Any core devs looking at this? Maybe it is a tendermint issue? @tessr could someone from your team take a look at these logs and maybe give us a hint what is off?

@tessr
Copy link
Contributor

tessr commented Feb 15, 2021

Thanks for tagging me on this - it's not obvious to me either if this is a Tendermint or SDK issue, but I'll ask someone on the team to take a look.

@melekes
Copy link
Contributor

melekes commented Feb 15, 2021

@ethanfrey is there anything in the logs? have you tried running it with debug log level?

@alexanderbez
Copy link
Contributor

alexanderbez commented Feb 15, 2021

@ethanfrey is there anything in the logs? have you tried running it with debug log level?

Preferably in pastebin

@ethanfrey
Copy link
Contributor Author

@melekes I am not sure the machines are still around, but I can see.

@kaustubhkapatral can you also try with the debug logs?

@orkunkl
Copy link

orkunkl commented Feb 15, 2021

Here are the debug logs from our main validator @alexanderbez https://pastebin.com/7Wgy9mh0

@clevinson clevinson mentioned this issue Feb 23, 2021
7 tasks
@clevinson
Copy link
Contributor

Speaking w/ @anilcse it seems this is something related to tendermint and not SDK. @tessr is there a corresponding tendermint issue we can link for tracking?

@tessr
Copy link
Contributor

tessr commented Feb 24, 2021

No - should we just transfer this issue over?

@ethanfrey
Copy link
Contributor Author

Feel free to move it over to tendermint. I was unclear what was going on here. It seem something similar has happened other places, so good to figure out what causes the liveness failure.

@tessr
Copy link
Contributor

tessr commented Mar 2, 2021

Turns out you can't move issues across different github orgs, so I've opened a new issue on tendermint/tendermint to point to this.

@alexanderbez
Copy link
Contributor

alexanderbez commented Mar 2, 2021

Here are the debug logs from our main validator @alexanderbez https://pastebin.com/7Wgy9mh0

This log corresponds to the statement below, correct?

We then decided to just skip it, using --unsafe-skip-upgrade on our validators (to whom we had staked almost 75% of the voting power). The nodes then came up and starting participating in consensus, but never came to make a new block, even though we have > 67% voting power in pre-commits.


And height 445130 is the last committed height of the chain before stalling out?

@alexanderbez
Copy link
Contributor

Ok so just by quickly grokking through the logs, I see that the validator does try to progress and commit a new height -- 445131. But then there is a round regression and then it eventually disconnects from all its peers.

Feb 15 16:23:31 validator cosmovisor[1841402]: 4:23PM ERR Error signing vote err="error signing vote: round regression at height 445131. Got 0, last round 33" height=445131 module=consensus round=0 vote={"block_id":{"hash":"","parts":{"hash":"","total":0}},"height":445131,"round":0,"signature":null,"timestamp":"2021-02-15T15:23:31.480915997Z","type":2,"validator_address":"6AF22951B16906D0027DE72885E0B47F29F957E5","validator_index":0}

This might be from the WAL replay and due to the way the upgrade-time logic was used. I agree with @aaronc and that height should only be used, because basing off of time could cause weird issues in terms of which steps which nodes run at which time, i.e. it's too subjective.

I'm not really sure there are any actionable items for the Tendermint team here to focus on @tessr.

@ethanfrey
Copy link
Contributor Author

I see that the validator does try to progress and commit a new height -- 445131. But then there is a round regression and then it eventually disconnects from all its peers.

Huh? After the precommits are there (from 71% of voting power), how can there be a "round regression"? What does that even mean?

@clevinson
Copy link
Contributor

Closing this in favor of #8801.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants