Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinning large CID is "stuck" at same node #8103

Open
kallisti5 opened this issue May 2, 2021 · 12 comments
Open

Pinning large CID is "stuck" at same node #8103

kallisti5 opened this issue May 2, 2021 · 12 comments
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization

Comments

@kallisti5
Copy link

kallisti5 commented May 2, 2021

Version information:

go-ipfs version: 0.8.0
Repo version: 11
System version: arm64/linux
Golang version: go1.15.8

I have 22GiB+ on MFS data on a node at Vultr. Latest MFS CID is pointed to via /ipns/hpkg.haiku-os.org

Description:

  • I pin /ipns/hpkg.haiku-os.org on my home IPFS node. (either of the below, same result)
    • ipfs pin add --progress /ipns/hpkg.haiku-os.org
    • ipfs pin add --progress /ipfs/QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
  • It downloads 20GiB of data
  • pin stuck on: Fetched/Processed 95036 nodes for like 2 days.
  • Restart ipfs. 20GiB deleted (--enable-gc, ugh)
  • I pin /ipns/hpkg.haiku-os.org again
  • 24 hours later, 22 GiB of data in repo.
  • pin once again stuck on: "Fetched/Processed 95036 nodes" for ~12 hours now.

sysstat from home node(while stuck pinning)

# sar 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local) 	05/02/21 	_aarch64_	(4 CPU)

13:06:38        CPU     %user     %nice   %system   %iowait    %steal     %idle
13:06:43        all      0.10     12.60      4.63      0.15      0.00     82.51
Average:        all      0.10     12.60      4.63      0.15      0.00     82.51
root@ipfsgw:~# sar -d 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local) 	05/02/21 	_aarch64_	(4 CPU)

13:05:56          DEV       tps     rkB/s     wkB/s   areq-sz    aqu-sz     await     svctm     %util
13:06:01     dev179-0      0.20      0.80      0.00      4.00      0.00      1.00      4.00      0.08
13:06:01       dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
13:06:01      dev8-16      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Debug from home node (while stuck pinning)

  • curl localhost:5001/debug/pprof/goroutine?debug=2 > ipfs.stacks
  • curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
  • curl localhost:5001/debug/pprof/heap > ipfs.heap
  • curl localhost:5001/debug/vars > ipfs.vars
  • ipfs diag sys > ipfs.sysinfo

homenode-debug.tar.gz

@kallisti5 kallisti5 added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 2, 2021
@aschmahmann
Copy link
Contributor

aschmahmann commented May 2, 2021

@kallisti5 could you grab a little more debugging info:

  • Vultr node
    • the debug info
    • ipfs config show
    • run ipfs dag stat --offline QmSSxDK8br4Z5UHQBXZWuXew8uP2uJ98oYT1HQAw7QrwKT to make sure it completes ok
      • If not then it means the data isn't on the node you think it's on
    • reboot the node and see if the problem resolves on the client end
      • To avoid any debugging confusion around data discovery from the client try doing ipfs swarm connect /full/address/of/vultr/node on the client node after the reboot
  • home node
    • ipfs config show
    • ipfs bitswap stat should show some of the blocks you are waiting on, if those blocks are part of the QmSS... graph it could be easier to determine what's going on

This might be a duplicate of #7972, which appears to be fixed in master of go-bitswap (the dependency update isn't yet included in go-ipfs master).

@kallisti5
Copy link
Author

Home node (wants)

ipfs@ipfsgw:~$ ipfs bitswap stat
bitswap status
        provides buffer: 0 / 256
        blocks received: 117310
        blocks sent: 12037
        data received: 23934805330
        data sent: 2561571207
        dup blocks received: 4393
        dup data received: 691026392
        wantlist [1 keys]
                Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
        partners [763]

Vultr node (has):

ipfs bitswap stat
bitswap status
	provides buffer: 0 / 256
	blocks received: 0
	blocks sent: 99009
	data received: 0
	data sent: 22941823079
	dup blocks received: 0
	dup data received: 0
	wantlist [0 keys]
	partners [785]

Vultr Node (has)
(putting the wantlist from my home node into dag stat for the vultr node:)

# ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
Size: 0, NumBlocks: 1

Vultr Node (has):

root@ipfs:~# ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 5111615603, NumBlocks: 22289
(still running. I'll update here if it fails)

@kallisti5
Copy link
Author

@aschmahmann
Copy link
Contributor

I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.

I noticed you updated the CID in your comment that you were trying to pin, since you have garbage collection enabled then it's possible that Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD has since been cleaned up on the Vultr node and so if the home node is still looking for it it's not going to find it.

@kallisti5
Copy link
Author

kallisti5 commented May 2, 2021

I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.

Ooh.. I think this might be the root cause. I posted ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD above from the Vultr node showing it did indeed have that block (though oddly empty). (Size: 0, NumBlocks: 1
)

Check this out (vultr node)

root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
{"data":null,"links":[]}

Here's the completed diag stat for the whole thing on the vultr node:

ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 22827570312, NumBlocks: 95529

So... locally it thinks it has that block, but remotely it reports not having it?

@aschmahmann
Copy link
Contributor

root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
{"data":null,"links":[]}

This seems a bit weird to me since I'd expect an empty DagPb node to have the hash QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n DagPb spec. I wonder if there is some datastore corruption (which would cause your node to send back data but my node to not recognize it as having the CID I was expecting).

Could you run ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car and then ipfs add the car file and post the CID here so I can take a look? Note: What I'm asking you to do is basically export the DAG underneath that CID (which is just a single node in this case) to a file so we can investigate it even if it has some corruption that is preventing it from being sent over the network.

It may take a while but it'd be helpful to run ipfs repo verify to check on repo corruption issues.

@kallisti5
Copy link
Author

# ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car
 0s  92 B / ? [-------------------------------------------------------------------------------------------------------------------=-------------------------------------------------------------------------------------------] 3.22 KiB/s 0s 
# ipfs add empty.car
added QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc empty.car

Running a ipfs repo verify now

@aschmahmann
Copy link
Contributor

aschmahmann commented May 2, 2021

❯ ipfs dag import .\QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc
Pinned root     Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD  FAILED: blockstore: block not found
Error: unable to pin all roots: 1 out of 1 failed

Ya, something weird is going on. I'm guessing a corrupted block. The CAR file contains a single block which is the empty block with the CID QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n.

@kallisti5
Copy link
Author

Yup.

# ipfs repo verify
block QmTRQhKWeLJUDvLqxDZmBkLwJSSy7B3Er9ytAhQcDiG1Lf was corrupt (block in storage has different hash than requested)
block QmZ7rUYoMm4cpnxCHS9KBJeuUPmcyZsg4Qgi3EAjUq3iTM was corrupt (block in storage has different hash than requested)
block Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD was corrupt (block in storage has different hash than requested)

How do you even fix that when it's MFS? (is there a way to identify which files and re-upload them? I don't really understand how they got corrupted.. the node is brand new and I just added the files without any real issues.

@kallisti5
Copy link
Author

To wrap this one up. I ended up just deleting the repo and re-adding the data under flatfile instead of badger. Things are smooth now.

Since this is the "seed" node for MFS data, repo stability is important. It sounds like badger is better for user nodes, while flat is better for "critical data sources"

However... this case might be useful for some enhancement? I had CID's that nothing saw as invalid (and the MFS filesystem was fully available) until I did a full repo validation. This seems like a potentially common issue.

@godcong
Copy link
Contributor

godcong commented May 11, 2021

I have had random stuck events.
But retry after the interruption passed.
I am using go-http-api-client
I was stuck for a whole day while pinning a cid.

Is this the same problem?
@kallisti5

@kallisti5
Copy link
Author

kallisti5 commented Sep 7, 2021

@godcong

I have had random stuck events.
But retry after the interruption passed.

There's a mix of issues here:

  • Repo corruption isn't obvious, and IPFS doesn't "notice" until you repo verify (which is slow)
  • Pinning large data sets, if IPFS can't find a 'valid' copy of data it just hangs 'waiting' for it
  • Pinning progress isn't really clear (ui: Improve progress reporting on pins #8112)
  • Enabled garbage collection and a low GCThreshold means if you 'interrupt' a pin of data over your GCThreshold, the node immediately will garbage collect your progress.

These are the "biggest" issues we encountered trying to use IPFS at scale. They represent some really frustrating user experiences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

3 participants