Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTCD miss a block and subsequent blocks are all orphan #229

Closed
ejfhp opened this issue Jan 16, 2015 · 22 comments
Closed

BTCD miss a block and subsequent blocks are all orphan #229

ejfhp opened this issue Jan 16, 2015 · 22 comments
Labels

Comments

@ejfhp
Copy link

ejfhp commented Jan 16, 2015

New installation of btcd 0.9 on a linux debian. Today my node (depsi.zapto.org) has block count 339153 since last night. From the log it looks like it missed the 339154 block, and since the 339155 every block is added as orphan, even if the hash it logs correspond to the right current block (checked on blockchain.info).

Sometimes it seems to start requesting some old block but never the 339154. There's no other error.
The same thing happened yesterday, and after a restart everything went fine.
The node is at my home, and I have a dynamic ip... but other than this problem everything seems to work well.

Thanks.

04:54:26 2015-01-16 [INF] BMGR: Processed 1 block in the last 1m22.57s (375 transactions, height 339153, 2015-01-16 03:55:40 +0100 CET)
04:54:26 2015-01-16 [INF] BMGR: Lost peer 85.175.5.129:21291 (inbound)
04:55:18 2015-01-16 [INF] BMGR: New valid peer 148.251.238.178:60281 (inbound) (/getaddr.bitnodes.io:0.1/)
04:55:18 2015-01-16 [INF] BMGR: Lost peer 148.251.238.178:60281 (inbound)
04:55:39 2015-01-16 [INF] BMGR: New valid peer 112.124.22.80:53714 (inbound) (/bitcoin-seeder:0.01/)
04:55:41 2015-01-16 [INF] BMGR: Lost peer 112.124.22.80:53714 (inbound)
04:55:46 2015-01-16 [INF] BMGR: New valid peer 82.130.102.226:33830 (inbound) (/Snoopy:0.1/)
04:55:46 2015-01-16 [INF] BMGR: Lost peer 82.130.102.226:33830 (inbound)
04:55:48 2015-01-16 [INF] BMGR: New valid peer 27.204.8.63:49756 (inbound) (/Satoshi:0.9.3/)
04:56:30 2015-01-16 [INF] BMGR: New valid peer 88.198.131.156:39072 (inbound) (/bitcoin-seeder:0.01/)
04:56:31 2015-01-16 [INF] BMGR: Lost peer 88.198.131.156:39072 (inbound)
04:57:35 2015-01-16 [INF] BMGR: New valid peer 199.127.225.195:43501 (inbound) (/bitcoin-seeder:0.01/)
04:58:42 2015-01-16 [INF] BMGR: New valid peer 172.245.62.102:36818 (inbound) (/bitcoin-seeder:0.01/)
04:58:42 2015-01-16 [INF] BMGR: Lost peer 172.245.62.102:36818 (inbound)
05:00:18 2015-01-16 [INF] BMGR: New valid peer 148.251.238.178:49893 (inbound) (/getaddr.bitnodes.io:0.1/)
05:00:18 2015-01-16 [INF] BMGR: Lost peer 148.251.238.178:49893 (inbound)
05:01:03 2015-01-16 [INF] BMGR: New valid peer 82.130.102.226:37309 (inbound) (/Snoopy:0.1/)
05:01:03 2015-01-16 [INF] BMGR: Lost peer 82.130.102.226:37309 (inbound)
05:02:35 2015-01-16 [WRN] PEER: Peer 199.127.225.195:43501 (inbound) no answer for 5 minutes, disconnecting
05:02:35 2015-01-16 [INF] BMGR: Lost peer 199.127.225.195:43501 (inbound)
05:03:31 2015-01-16 [INF] BMGR: New valid peer 50.7.47.93:48617 (inbound) (/bitcoin-seeder:0.01/)
05:03:33 2015-01-16 [INF] BMGR: Lost peer 50.7.47.93:48617 (inbound)
05:05:39 2015-01-16 [INF] CHAN: Adding orphan block 000000000000000001e2aae6be846835bbebddbf01baadecebf6df532a37468b with parent 0000000000000000148c1ae2b62db079f8cc21501141e4bd582bebd1aa9b1c7f
05:06:30 2015-01-16 [INF] BMGR: New valid peer 82.130.102.226:47264 (inbound) (/Snoopy:0.1/)
05:06:30 2015-01-16 [INF] BMGR: Lost peer 82.130.102.226:47264 (inbound)
05:06:45 2015-01-16 [INF] BMGR: New valid peer 148.251.238.178:19837 (inbound) (/getaddr.bitnodes.io:0.1/)
05:06:45 2015-01-16 [INF] BMGR: Lost peer 148.251.238.178:19837 (inbound)
05:09:31 2015-01-16 [INF] BMGR: New valid peer 50.244.13.28:44584 (inbound) (/bitcoin-seeder:0.01/)
05:09:32 2015-01-16 [INF] BMGR: Lost peer 50.244.13.28:44584 (inbound)
05:10:34 2015-01-16 [INF] BMGR: New valid peer 178.62.20.190:46658 (inbound) (/bitcoin-seeder:0.01/)
05:10:41 2015-01-16 [INF] BMGR: Lost peer 178.62.20.190:46658 (inbound)
05:11:36 2015-01-16 [INF] BMGR: New valid peer 82.130.102.226:52825 (inbound) (/Snoopy:0.1/)
05:11:36 2015-01-16 [INF] BMGR: Lost peer 82.130.102.226:52825 (inbound)
05:13:29 2015-01-16 [INF] BMGR: New valid peer 106.187.90.222:52862 (inbound) (/bitcoin-seeder:0.01/)
05:13:30 2015-01-16 [INF] BMGR: Lost peer 106.187.90.222:52862 (inbound)
05:14:16 2015-01-16 [WRN] PEER: Peer 27.204.8.63:49756 (inbound) no answer for 5 minutes, disconnecting
05:14:16 2015-01-16 [INF] BMGR: Lost peer 27.204.8.63:49756 (inbound)
05:15:26 2015-01-16 [INF] BMGR: New valid peer 148.251.238.178:10378 (inbound) (/getaddr.bitnodes.io:0.1/)
05:15:27 2015-01-16 [INF] BMGR: Lost peer 148.251.238.178:10378 (inbound)
05:16:18 2015-01-16 [INF] BMGR: New valid peer 178.62.187.91:54127 (inbound) (/mining.bitcoinaffiliatenetwork.com amsterdam:0.10.0/)
05:16:58 2015-01-16 [INF] BMGR: New valid peer 82.130.102.226:35860 (inbound) (/Snoopy:0.1/)
05:16:58 2015-01-16 [INF] BMGR: Lost peer 82.130.102.226:35860 (inbound)
05:17:41 2015-01-16 [INF] CHAN: Adding orphan block 0000000000000000097b41645adfb23cd199e3ba0862867c13c65b44d9269bd5 with parent 000000000000000001e2aae6be846835bbebddbf01baadecebf6df532a37468b
05:18:18 2015-01-16 [INF] BMGR: New valid peer 112.124.22.80:60503 (inbound) (/bitcoin-seeder:0.01/)
05:18:19 2015-01-16 [INF] BMGR: Lost peer 112.124.22.80:60503 (inbound)
05:18:54 2015-01-16 [WRN] PEER: Peer 5.138.146.164:10987 (inbound) no answer for 5 minutes, disconnecting
05:18:54 2015-01-16 [INF] BMGR: Lost peer 5.138.146.164:10987 (inbound)
05:21:22 2015-01-16 [INF] CHAN: Adding orphan block 00000000000000000cd0c6701ebfa6f14fede7db8a8d037e458f44cd70c32ebf with parent 0000000000000000097b41645adfb23cd199e3ba0862867c13c65b44d9269bd5
05:21:39 2015-01-16 [INF] BMGR: New valid peer 148.251.238.178:41738 (inbound) (/getaddr.bitnodes.io:0.1/)
05:21:39 2015-01-16 [INF] BMGR: Lost peer 148.251.238.178:41738 (inbound)
05:21:59 2015-01-16 [INF] BMGR: New valid peer 82.130.102.226:39289 (inbound) (/Snoopy:0.1/)
05:22:01 2015-01-16 [INF] BMGR: Lost peer 82.130.102.226:39289 (inbound)
05:22:36 2015-01-16 [INF] BMGR: New valid peer 72.188.5.125:49571 (inbound) (/BitCoinJ:0.11.2/MultiBit:0.5.18/)
05:22:41 2015-01-16 [INF] BMGR: New valid peer 216.87.0.198:52453 (inbound) (/bitcoinj:0.12.2/Bitcoin Wallet:4.16/)
05:22:49 2015-01-16 [INF] CHAN: Adding orphan block 0000000000000000162155d22e03e2760e02c6748566adfda3671ed9b29a97ec with parent 00000000000000000cd0c6701ebfa6f14fede7db8a8d037e458f44cd70c32ebf

@jrick
Copy link
Member

jrick commented Jan 16, 2015

I've seen this issue a few times as well. FWIW, I'm also on a dynamic IP, behind NAT, and using --externalip. My node is running on a Windows 8.1 desktop.

The proper behavior would be for the node which sent the orphan to also send the parent. If btcd didn't receive the parent for some reason, it should getdata the parent blocks until the chain attaches, but this code doesn't seem to be working as intended.

When I have noticed it, the missing block can still be accepted over RPC using submitblock, so the block manager isn't totally locked up (just missing the needed blocks). However, one time when I was fumbling through btcctl and submitblock (I had to hardcode the block hex into btcctl because otherwise it would segfault, presumably due to the btcctl arguments being too long), it rejected the block because it was already downloaded and validated. So I don't think this is a permanent error.

After submitting blocks manually so that orphans can also be validated, while still downloading from the same syncpeer, the orphan problems continue to occur. If I change my syncpeer, the problems seem to go away or become less frequent. This indicates that the other node seems to be doing something we don't expect, and if we can narrow that down it should be easier to reproduce and fix our bug.

@ejfhp
Copy link
Author

ejfhp commented Jan 16, 2015

Thanks a lot, next time I'll try to use the submitblock, this time I restarted and now it is syncing well.
I'll set log to debug to see if there's something useful.

@jrick
Copy link
Member

jrick commented Jan 16, 2015

Great. Could you also run with profiling enabled (--profile=portnum)? If there is a synchronization bug between goroutines, it should be noticeable in the stack dumps.

@ejfhp
Copy link
Author

ejfhp commented Jan 16, 2015

Can you tell me the port num for profiling?

@jrick
Copy link
Member

jrick commented Jan 16, 2015

It can be anything (that you have permissions for). It runs a profiling server that you can then access from http://localhost:####/

I typically use 6060 for my godoc instance, 6061 for btcd, and 6062 for wallet, but you can be creative if you want. :)

@ejfhp
Copy link
Author

ejfhp commented Jan 16, 2015

Ahh.. ok! :)

I'll keep you updated.
Thanks

@ejfhp
Copy link
Author

ejfhp commented Jan 20, 2015

Again, this time stuck at block 339748.
I have some logs.

@jrick
Copy link
Member

jrick commented Jan 20, 2015

Can you post the logs of when it started and the contents of the "full goroutine stack dump" from the profiling server?

@ejfhp
Copy link
Author

ejfhp commented Jan 20, 2015

Dump: https://www.dropbox.com/s/e1d1o4prpbpadrc/dump.txt?dl=0
Log: https://www.dropbox.com/s/8hgydvz065nto8e/btcd.log?dl=0

I set debug just after, I hope this could help. If you need more just ask.
I'll wait some more time to see if it will sync by itself, blockcount is still 339748.

@jrick
Copy link
Member

jrick commented Jan 20, 2015

I hit this as well earlier today and have been poking at it. Something interesting that I noticed is that the reported height of my sync peer (the height when it connected) is after my current node's last processed block, so it must have not have correctly synced to the sync peer right after it connected.

Can you check your node to see if you're in the same situation? Look at the output of btcctl getpeerinfo and find the one with "syncnode": true, find the startingheight field, and then compare that to the block height from btcctl getblockcount?

@jrick
Copy link
Member

jrick commented Jan 20, 2015

Well, according to your logs, you got a new sync peer that connected with a height of 339752, processed one block and made it to 339748, and then started receiving orphans. Likely the same issue I'm seeing.

@ejfhp
Copy link
Author

ejfhp commented Jan 20, 2015

Now this is my situation:

btcctl getpeerinfo
"addr": "98.214.78.184:8333",
"services": "00000001",
"lastsend": 1421787620,
"lastrecv": 1421787620,
"bytessent": 2143822,
"bytesrecv": 160368885,
"pingtime": 0,
"conntime": 1421754701,
"version": 70002,
"subver": "/Satoshi:0.9.1/",
"inbound": false,
"startingheight": 339767,
"syncnode": true

btcctl getblockcount
339748

@ejfhp
Copy link
Author

ejfhp commented Jan 20, 2015

Is it right to have two entries for the same peer ip address? One with syncnode=true and the other with syncnode=false.

{
"addr": "98.214.78.184:8333",
"services": "00000001",
"lastsend": 1421788395,
"lastrecv": 1421788396,
"bytessent": 2200016,
"bytesrecv": 161719211,
"pingtime": 0,
"conntime": 1421754701,
"version": 70002,
"subver": "/Satoshi:0.9.1/",
"inbound": false,
"startingheight": 339767,
"syncnode": true
}
{
"addr": "98.214.78.184:8333",
"services": "00000001",
"lastsend": 1421788399,
"lastrecv": 1421788395,
"bytessent": 1650,
"bytesrecv": 4214,
"pingtime": 0,
"conntime": 1421788329,
"version": 70002,
"subver": "/Satoshi:0.9.1/",
"inbound": false,
"startingheight": 339815,
"syncnode": false
}

I also have a couple of peers with height less than mine but they are not syncnodes:

"startingheight": 283676,
"syncnode": false

@ejfhp
Copy link
Author

ejfhp commented Jan 20, 2015

I tried to add and remove nodes but nothing happens, it seems to be stick on this node even if it is not able to sync correctly.

@jrick
Copy link
Member

jrick commented Jan 20, 2015

That is a bitcoind compatibility "feature". You can't disconnect from nodes using addnode <ip> remove, unless they were also added using addnode.

@ejfhp
Copy link
Author

ejfhp commented Jan 20, 2015

Ah.. ok. :)

@davecgh
Copy link
Member

davecgh commented Jan 20, 2015

I don't much care for that functionality either, but out goal is to be compatible with BC as much as possible to more easily accommodate existing infrastructures.

That said, I suspect allowing it to be used to also disconnect a node that wasn't added via addnode would probably not break any callers since it's additional functionality and thus deviating here might make sense.

@jrick
Copy link
Member

jrick commented Feb 9, 2015

As an additional data point, after seeing this issue again and forcibly killing the connection to my syncnode with pfctl (so it picks a new one), the problem remains.

goroutines

@jrick
Copy link
Member

jrick commented Feb 12, 2015

We've narrowed this down to btcd filtering out getblocks messages that have already been sent, where no reply has yet been received. If the reply is never received, we will never re-request the blocks, and because we remain connected to the same sync node until they disconnect, this can cause us to never process the missing parent blocks.

@ejfhp
Copy link
Author

ejfhp commented Feb 13, 2015

Great, I'm happy you've found the issue.
We need a timeout.. :)

@toddfries
Copy link
Contributor

Any new developments or ways we can help track this down in the wild? For whatever reason, my node seems prone to doing this lately.

@davecgh davecgh added the bug label Sep 22, 2015
@toddfries
Copy link
Contributor

So I've had a request to test this next time it occurs (I added the awk bits to make it simple):

btcctl node disconnect $(btcctl getpeerinfo|awk '/"id"/{sub(",","",$2);id=$2}/syncnode/{if($2 == "true") {print id}}')

Roasbeef added a commit to Roasbeef/btcd that referenced this issue Oct 2, 2015
This commit fixes a long standing bug which previously, would cause a
syncing node to get "stuck", halting all forward progress with respect
to keeping up with the main chain.

There are many nodes on the network which claim to be a fully
functioning full-node, but don't actually properly implement the entire
p2p protocol. For this particular case, after receiving a "getdata"
requesting blocks, these nodes fail to respond with a "block" message. This would
cause btcd to stall block download indefinitely, waiting for a response
which never arrives. Since btcd currently syncs new blocks via a single
syncPeer, it would never attempt to seek to new, active peers to
download the next block from. To fix this issue, we now disconnect peers
that we deem are stalling new block download after a specified timeout.

This fixes btcsuite#486 and btcsuite#229.
Roasbeef added a commit to Roasbeef/btcd that referenced this issue Oct 9, 2015
This commit fixes a long standing bug which previously, would cause a
syncing node to get "stuck", halting all forward progress with respect
to keeping up with the main chain.

There are many nodes on the network which claim to be a fully
functioning full-node, but don't actually properly implement the entire
p2p protocol. For this particular case, after receiving a "getdata"
requesting blocks, these nodes fail to respond with a "block" message. This would
cause btcd to stall block download indefinitely, waiting for a response
which never arrives. Since btcd currently syncs new blocks via a single
syncPeer, it would never attempt to seek to new, active peers to
download the next block from. To fix this issue, we now disconnect peers
that we deem are stalling new block download after a specified timeout.

This fixes btcsuite#486 and btcsuite#229.
Roasbeef referenced this issue in davecgh/btcd Oct 21, 2015
This commit implements stall detection logic at the peer level to detect
and disconnect peers that are either not following the protocol in
regards to expected response messages or have otherwise stalled.  This
is accomplished by setting deadlines for each message type which expects
a response and periodically checking them while properly taking into
account processing time.

There are an increasing number of nodes on the network which claim to be
full nodes, but don't actually properly implement the entire p2p
protocol even though they implement it enough to cause properly
implemented nodes to make data requests to which they never respond.

Since btcd currently only syncs new blocks via single sync peer and,
prior to this commit only had very basic stall detection, this could
lead to a situation where the block download became stalled indefinitely
due to one of these misbehaving peers.  This commit fixes that issue
since the stalled peer will now be detected and disconnected which leads
to a new sync peer being selected.

This logic will also fit nicely with the future multi-peer sync model
which is on the roadmap and for which infrastructure work is underway.

Closes btcsuite#486.
Roasbeef referenced this issue in davecgh/btcd Oct 21, 2015
This commit implements stall detection logic at the peer level to detect
and disconnect peers that are either not following the protocol in
regards to expected response messages or have otherwise stalled.  This
is accomplished by setting deadlines for each message type which expects
a response and periodically checking them while properly taking into
account processing time.

There are an increasing number of nodes on the network which claim to be
full nodes, but don't actually properly implement the entire p2p
protocol even though they implement it enough to cause properly
implemented nodes to make data requests to which they never respond.

Since btcd currently only syncs new blocks via single sync peer and,
prior to this commit only had very basic stall detection, this could
lead to a situation where the block download became stalled indefinitely
due to one of these misbehaving peers.  This commit fixes that issue
since the stalled peer will now be detected and disconnected which leads
to a new sync peer being selected.

This logic will also fit nicely with the future multi-peer sync model
which is on the roadmap and for which infrastructure work is underway.

Closes btcsuite#486.
davecgh added a commit to davecgh/btcd that referenced this issue Oct 22, 2015
This commit implements stall detection logic at the peer level to detect
and disconnect peers that are either not following the protocol in
regards to expected response messages or have otherwise stalled.  This
is accomplished by setting deadlines for each message type which expects
a response and periodically checking them while properly taking into
account processing time.

There are an increasing number of nodes on the network which claim to be
full nodes, but don't actually properly implement the entire p2p
protocol even though they implement it enough to cause properly
implemented nodes to make data requests to which they never respond.

Since btcd currently only syncs new blocks via single sync peer and,
prior to this commit only had very basic stall detection, this could
lead to a situation where the block download became stalled indefinitely
due to one of these misbehaving peers.  This commit fixes that issue
since the stalled peer will now be detected and disconnected which leads
to a new sync peer being selected.

This logic will also fit nicely with the future multi-peer sync model
which is on the roadmap and for which infrastructure work is underway.

Fixes btcsuite#486 and fixes btcsuite#229.
davecgh added a commit to davecgh/btcd that referenced this issue Oct 23, 2015
This commit implements stall detection logic at the peer level to detect
and disconnect peers that are either not following the protocol in
regards to expected response messages or have otherwise stalled.  This
is accomplished by setting deadlines for each message type which expects
a response and periodically checking them while properly taking into
account processing time.

There are an increasing number of nodes on the network which claim to be
full nodes, but don't actually properly implement the entire p2p
protocol even though they implement it enough to cause properly
implemented nodes to make data requests to which they never respond.

Since btcd currently only syncs new blocks via single sync peer and,
prior to this commit only had very basic stall detection, this could
lead to a situation where the block download became stalled indefinitely
due to one of these misbehaving peers.  This commit fixes that issue
since the stalled peer will now be detected and disconnected which leads
to a new sync peer being selected.

This logic will also fit nicely with the future multi-peer sync model
which is on the roadmap and for which infrastructure work is underway.

Fixes btcsuite#486 and fixes btcsuite#229.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants