Pinning large CID is "stuck" at same node #8103

kallisti5 · 2021-05-02T13:11:46Z

Version information:

go-ipfs version: 0.8.0
Repo version: 11
System version: arm64/linux
Golang version: go1.15.8

I have 22GiB+ on MFS data on a node at Vultr. Latest MFS CID is pointed to via /ipns/hpkg.haiku-os.org

Description:

I pin /ipns/hpkg.haiku-os.org on my home IPFS node. (either of the below, same result)
- ipfs pin add --progress /ipns/hpkg.haiku-os.org
- ipfs pin add --progress /ipfs/QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
It downloads 20GiB of data
pin stuck on: Fetched/Processed 95036 nodes for like 2 days.
Restart ipfs. 20GiB deleted (--enable-gc, ugh)
I pin /ipns/hpkg.haiku-os.org again
24 hours later, 22 GiB of data in repo.
pin once again stuck on: "Fetched/Processed 95036 nodes" for ~12 hours now.

sysstat from home node(while stuck pinning)

# sar 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local) 	05/02/21 	_aarch64_	(4 CPU)

13:06:38        CPU     %user     %nice   %system   %iowait    %steal     %idle
13:06:43        all      0.10     12.60      4.63      0.15      0.00     82.51
Average:        all      0.10     12.60      4.63      0.15      0.00     82.51

root@ipfsgw:~# sar -d 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local) 	05/02/21 	_aarch64_	(4 CPU)

13:05:56          DEV       tps     rkB/s     wkB/s   areq-sz    aqu-sz     await     svctm     %util
13:06:01     dev179-0      0.20      0.80      0.00      4.00      0.00      1.00      4.00      0.08
13:06:01       dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
13:06:01      dev8-16      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Debug from home node (while stuck pinning)

curl localhost:5001/debug/pprof/goroutine?debug=2 > ipfs.stacks
curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
curl localhost:5001/debug/pprof/heap > ipfs.heap
curl localhost:5001/debug/vars > ipfs.vars
ipfs diag sys > ipfs.sysinfo

homenode-debug.tar.gz

The text was updated successfully, but these errors were encountered:

aschmahmann · 2021-05-02T13:41:52Z

@kallisti5 could you grab a little more debugging info:

Vultr node
- the debug info
- ipfs config show
- run ipfs dag stat --offline QmSSxDK8br4Z5UHQBXZWuXew8uP2uJ98oYT1HQAw7QrwKT to make sure it completes ok
  - If not then it means the data isn't on the node you think it's on
- reboot the node and see if the problem resolves on the client end
  - To avoid any debugging confusion around data discovery from the client try doing ipfs swarm connect /full/address/of/vultr/node on the client node after the reboot
home node
- ipfs config show
- ipfs bitswap stat should show some of the blocks you are waiting on, if those blocks are part of the QmSS... graph it could be easier to determine what's going on

This might be a duplicate of #7972, which appears to be fixed in master of go-bitswap (the dependency update isn't yet included in go-ipfs master).

kallisti5 · 2021-05-02T15:46:36Z

Home node (wants)

ipfs@ipfsgw:~$ ipfs bitswap stat
bitswap status
        provides buffer: 0 / 256
        blocks received: 117310
        blocks sent: 12037
        data received: 23934805330
        data sent: 2561571207
        dup blocks received: 4393
        dup data received: 691026392
        wantlist [1 keys]
                Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
        partners [763]

Vultr node (has):

ipfs bitswap stat
bitswap status
	provides buffer: 0 / 256
	blocks received: 0
	blocks sent: 99009
	data received: 0
	data sent: 22941823079
	dup blocks received: 0
	dup data received: 0
	wantlist [0 keys]
	partners [785]

Vultr Node (has)
(putting the wantlist from my home node into dag stat for the vultr node:)

# ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
Size: 0, NumBlocks: 1

Vultr Node (has):

root@ipfs:~# ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 5111615603, NumBlocks: 22289
(still running. I'll update here if it fails)

kallisti5 · 2021-05-02T15:50:25Z

ipfsconfig-home.txt
ipfsconfig-vultr.txt

aschmahmann · 2021-05-02T16:53:15Z

I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.

I noticed you updated the CID in your comment that you were trying to pin, since you have garbage collection enabled then it's possible that Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD has since been cleaned up on the Vultr node and so if the home node is still looking for it it's not going to find it.

kallisti5 · 2021-05-02T16:59:06Z

I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.

Ooh.. I think this might be the root cause. I posted ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD above from the Vultr node showing it did indeed have that block (though oddly empty). (Size: 0, NumBlocks: 1
)

Check this out (vultr node)

root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
{"data":null,"links":[]}

Here's the completed diag stat for the whole thing on the vultr node:

ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 22827570312, NumBlocks: 95529

So... locally it thinks it has that block, but remotely it reports not having it?

aschmahmann · 2021-05-02T17:20:21Z

root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
{"data":null,"links":[]}

This seems a bit weird to me since I'd expect an empty DagPb node to have the hash QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n DagPb spec. I wonder if there is some datastore corruption (which would cause your node to send back data but my node to not recognize it as having the CID I was expecting).

Could you run ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car and then ipfs add the car file and post the CID here so I can take a look? Note: What I'm asking you to do is basically export the DAG underneath that CID (which is just a single node in this case) to a file so we can investigate it even if it has some corruption that is preventing it from being sent over the network.

It may take a while but it'd be helpful to run ipfs repo verify to check on repo corruption issues.

kallisti5 · 2021-05-02T17:22:39Z

# ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car
 0s  92 B / ? [-------------------------------------------------------------------------------------------------------------------=-------------------------------------------------------------------------------------------] 3.22 KiB/s 0s 
# ipfs add empty.car
added QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc empty.car

Running a ipfs repo verify now

aschmahmann · 2021-05-02T17:25:41Z

❯ ipfs dag import .\QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc
Pinned root     Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD  FAILED: blockstore: block not found
Error: unable to pin all roots: 1 out of 1 failed

Ya, something weird is going on. I'm guessing a corrupted block. The CAR file contains a single block which is the empty block with the CID QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n.

kallisti5 · 2021-05-02T21:11:29Z

Yup.

# ipfs repo verify
block QmTRQhKWeLJUDvLqxDZmBkLwJSSy7B3Er9ytAhQcDiG1Lf was corrupt (block in storage has different hash than requested)
block QmZ7rUYoMm4cpnxCHS9KBJeuUPmcyZsg4Qgi3EAjUq3iTM was corrupt (block in storage has different hash than requested)
block Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD was corrupt (block in storage has different hash than requested)

How do you even fix that when it's MFS? (is there a way to identify which files and re-upload them? I don't really understand how they got corrupted.. the node is brand new and I just added the files without any real issues.

kallisti5 · 2021-05-04T01:16:14Z

To wrap this one up. I ended up just deleting the repo and re-adding the data under flatfile instead of badger. Things are smooth now.

Since this is the "seed" node for MFS data, repo stability is important. It sounds like badger is better for user nodes, while flat is better for "critical data sources"

However... this case might be useful for some enhancement? I had CID's that nothing saw as invalid (and the MFS filesystem was fully available) until I did a full repo validation. This seems like a potentially common issue.

godcong · 2021-05-11T03:03:15Z

I have had random stuck events.
But retry after the interruption passed.
I am using go-http-api-client
I was stuck for a whole day while pinning a cid.

Is this the same problem?
@kallisti5

kallisti5 · 2021-09-07T13:20:50Z

@godcong

I have had random stuck events.
But retry after the interruption passed.

There's a mix of issues here:

Repo corruption isn't obvious, and IPFS doesn't "notice" until you repo verify (which is slow)
Pinning large data sets, if IPFS can't find a 'valid' copy of data it just hangs 'waiting' for it
Pinning progress isn't really clear (ui: Improve progress reporting on pins #8112)
Enabled garbage collection and a low GCThreshold means if you 'interrupt' a pin of data over your GCThreshold, the node immediately will garbage collect your progress.
- Provide direct support for Best-Effort pins? #3121 seems to be the "fix" for this one?
- An alternative "nice feature" would be to wait XX hours before garbage collecting any new data chunks. (giving users a chance at "resuming")

These are the "biggest" issues we encountered trying to use IPFS at scale. They represent some really frustrating user experiences.

kallisti5 added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 2, 2021

kallisti5 mentioned this issue May 2, 2021

MFS ls recursive #8104

Open

kallisti5 mentioned this issue Sep 7, 2021

blog: Add IPFS experiment post haiku/website#497

Closed

This was referenced Sep 7, 2021

Garbage Collection minimum data lifetime #8412

Open

Warn users when pinned dataset is larger than local GCThreshold #8413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pinning large CID is "stuck" at same node #8103

Pinning large CID is "stuck" at same node #8103

kallisti5 commented May 2, 2021 •

edited

Loading

aschmahmann commented May 2, 2021 •

edited

Loading

kallisti5 commented May 2, 2021

kallisti5 commented May 2, 2021

aschmahmann commented May 2, 2021

kallisti5 commented May 2, 2021 •

edited

Loading

aschmahmann commented May 2, 2021

kallisti5 commented May 2, 2021

aschmahmann commented May 2, 2021 •

edited

Loading

kallisti5 commented May 2, 2021

kallisti5 commented May 4, 2021

godcong commented May 11, 2021 •

edited

Loading

kallisti5 commented Sep 7, 2021 •

edited by lidel

Loading

Pinning large CID is "stuck" at same node #8103

Pinning large CID is "stuck" at same node #8103

Comments

kallisti5 commented May 2, 2021 • edited Loading

Version information:

Description:

sysstat from home node(while stuck pinning)

Debug from home node (while stuck pinning)

aschmahmann commented May 2, 2021 • edited Loading

kallisti5 commented May 2, 2021

kallisti5 commented May 2, 2021

aschmahmann commented May 2, 2021

kallisti5 commented May 2, 2021 • edited Loading

aschmahmann commented May 2, 2021

kallisti5 commented May 2, 2021

aschmahmann commented May 2, 2021 • edited Loading

kallisti5 commented May 2, 2021

kallisti5 commented May 4, 2021

godcong commented May 11, 2021 • edited Loading

kallisti5 commented Sep 7, 2021 • edited by lidel Loading

kallisti5 commented May 2, 2021 •

edited

Loading

aschmahmann commented May 2, 2021 •

edited

Loading

kallisti5 commented May 2, 2021 •

edited

Loading

aschmahmann commented May 2, 2021 •

edited

Loading

godcong commented May 11, 2021 •

edited

Loading

kallisti5 commented Sep 7, 2021 •

edited by lidel

Loading