avoid duplicating files added to ipfs #875

anarcat · 2015-03-07T05:36:25Z

it would be very useful to have files that are passed through ipfs add not copied into the datastore. for example here, i added a 3.2GB file, which meant the disk usage for that file now doubled!

Basically, it would be nice if the space usage for adding files would be O(1) instead of O(n) where n is the file sizes...

The text was updated successfully, but these errors were encountered:

jbenet · 2015-03-07T09:08:24Z

Yep, this can be implemented as either (a) a different repo altogether, or (b) just a different datastore. It should certainly be an advanced feature, as moving or modifying the original file at all would render the objects useless, so users should definitely know what they're doing.

note it is impossible for ipfs to monitor changes constantly, as it may be shut down when the user modifies the files. this sort of thing requires an explicit intention to use it this way. An intermediate point might be to give ipfs a set of directories to watch/scan and make available locally. this may be cpu intensive (may require lots of hashing on each startup, etc).

anarcat · 2015-03-07T13:13:46Z

the way git-annex deals with this is by moving the file to a hidden directory (.git/annex/objects/[hashtree]/[hash]), making it readonly and symlinking the original file.

it's freaking annoying to have all those symlinks there, but at least there's only one copy of the file.

MichaelMure · 2015-03-09T14:37:56Z

ipfs could track files the same way a media player trak its media collection:

track files in the background, possibly with a low OS priority
do a complete check (hashthe file) on demand, when the file is requested by the network or the user
quickly invalidate already shared files by checking if they still exist on disk and if the filesize change

vitzli · 2015-03-12T08:43:02Z

~~Hello, my name is vitzli and I'm a dataholic~~
I am just a user and I operate several private/LAN repositories for open source projects, that gives me ability to update my operating systems and speed-up VM deployment when Internet connection doesn't work very well, right now Debian repository is approximately 173GB and 126000 files, debian images are about 120GB and I share them over BitTorrent (I'm using jigdo-lite to build them from the repository and download the difference between current repo and required template from the mirror), while I prefer to use public and official torrent trackers, some projects, like FreeBSD, do not offer torrent files and I get them over private trackers.
Same debian/centos images are there too and I don't mind sharing them for some sweet ratio. There is no need for them to be writable, so keep them as owned by root with 644 permissions. Unfortunately, people combine several images into one torrent and it breaks infohash (and DHT swarms got separated too) and I have to have two copies of iso images (I symlink/hardlink them to cope with that). As far as I understand, this won't be an issue with ipfs, but I would really like to keep those files there (as files in ro/root:644 partition, not symlinks/hardlinks; potentially, they could be mounted over local network). If ipfs could be used to clone/copy/provide CDN for Internet Archive and archive team - problems could be similar. Here is my list of ~~demands~~, ~~dealbreakers~~ thoughts for ipfs addref command (or whatever it may be called):

I assume that:
1. files in external to ipfs storage are much more often being read than written;
2. Probably nobody would like to break their existing storage for ipfs;
3. I call files and directories a file;
4. I use B for bytes and b for bits;
files should be referenced in ipfs, not the other way; There is ipns/ipfs mount points, but I need to think/read/practice about that.
files are stored as files in a ext4/ZFS/XFS filesystem with arbitrary directory structure, it could be (more likely) be read-only, or mounted on read-only partition, or:
files are accessed over mounted network directory (NAS/SMB, NFS, Ceph, something else), that box could (should, really) be accessed in read-only mode, being hit by a cryptolocker, capable of encrypting a network-attached storage is a hell of a day;
personally, I'm okay with 1kB of ipfs storage on average per 256 kB of referenced data: this gives 1.1 GB of ipfs storage per 300 GB referenced files, 39 GB of ipfs storage per 10TB of files, and 273 GB per 70 TB - I could live with that, but it could be less;
ability to put files into fast SSD-like cache (configurable per file, root-node, source directory? seems like related feature, but it could/should be offloaded to underlying filesystem)
I am sorry for harsh words, but rehashing referenced files on startup is unacceptable - for 200k files and 400GB of repository it may require tens of minutes (and I don't want to think of rehashing 60TB of data), even trivial checking for size and modification/creation date would be slow-ish (maybe 'check file's hash on request' flag for files?). Although I would agree with rehasing on demand and on referencing file;
I have no opinion on tracking files in background mode, it may be a feature, but I have no idea how it could look performance-wise; it could be a million text/small-ish binary files, so… providing centos+fedora+epel+debian+ubuntu+freebsd mirror to ipfs perhaps would break 1 million files barrier and 1TB size;
[very unlikely] ability to pull file/directory from ipfs storage and reference it back. It could be split into get and addref tasks - this seems excessive, but somebody may ask for it.

rubiojr · 2015-04-08T11:35:30Z

Here's a disk usage plot when adding a large (~3.4 GiB) file:

[1059][rubiojr@octox] ./df-monitor.sh ipfs add ~/mnt/octomac/Downloads/VMware-VIMSetup-all-5.5.0-1623099-20140201-update01.iso
1.51 GB / 3.34 GB [===============================>--------------------------------------] 45.13 % 5m6s
Killed

~12 GiB used while adding the file.

Half way I need to kill ipfs because I'm running out of space.

Somewhat related, is there a way to cleanup the partially added stuff after killing ipfs add?

UPDATE: it seems that ipfs repo gc helps a bit with the cleanup, but does not recover all the space.

rubiojr · 2015-04-08T12:00:59Z

A couple of extra notes about the disk usage:

The file was the first one added after ipfs init && ipfs daemon
If I re-add the file after killing the first ipfs add and running ipfs repo gc the file is added correctly using only the required disk space:

[1029][rubiojr@octox] du -sh ~/.go-ipfs/datastore/
3,4G    /home/rubiojr/.go-ipfs/datastore/

If I add another large file after the first one, the disk used during the operation is roughly the same as the file size added (which is expected I guess).

Anyway, I've heard you guys are working on an a new repo backed so I just added this for the sake of completion.

whyrusleeping · 2015-04-08T18:50:29Z

@rubiojr the disk space is being consumed by the eventlogs, which is on my short list for removing from ipfs. check ~/.go-ipfs/logs

rubiojr · 2015-04-08T21:07:32Z

@whyrusleeping not in this case apparently:

[~/.go-ipfs]
[1106][rubiojr@octox] du -h --max-depth 1
12G ./datastore
5,8M    ./logs
12G .

[~/.go-ipfs/datastore]
[1109][rubiojr@octox] ls -latrS *.ldb|wc -l
6280

[~/.go-ipfs/datastore]
[1112][rubiojr@octox] ls -latrSh *.ldb|tail -n5
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr  8 22:59 000650.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr  8 23:00 002678.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr  8 23:02 005705.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr  8 23:01 004332.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr  8 23:00 001662.ldb

6280 ldb files ~~averaging 3.8MB files each~~. This is while adding a 1.7GiB file and killing the process before ipfs add finishes. First ipfs add after running ipfs daemon -init.

rubiojr · 2015-04-08T21:13:28Z

The leveldb files did not average 3.8 MiB each, some of them were smaller in fact. My bad.

whyrusleeping · 2015-04-08T21:39:14Z

wow. That sucks. But should be fixed quite soon, i just finished the migration tool to move block storage out of leveldb.

jbenet · 2015-08-05T10:30:28Z

since this is a highly requested feature, can we get some proposals of how it would work with the present fsrepo ?

cryptix · 2015-08-05T10:59:40Z

My proposal would be a shallow repo that acts like an index of torrent files. Where it thinks it can serve a block until it tries to open the file from the underlying file system.

I'm not sure how to manage chunking. Saving (hash)->(file path, offset) should be fine, I guess?

loadletter · 2015-08-05T12:04:28Z

Saving (hash)->(file path, offset) should be fine

Something like (hash)->(file path, mtime, offset) would help checking if the file was changed.

whyrusleeping · 2015-08-05T14:07:25Z

something like (hash)->(path, offset, length) is what we would need. and rehash the data upon read to ensure the hash matches.

jbenet · 2015-08-05T20:51:11Z

piecing it together with the repo is trickier. maybe it can be a special datastore that stores this index info in the flatfs, but delegates looking up the blocks on disk. something like

// in shallowfs
// stores things only under dataRoot. dataRoot could be `/`.
// stores paths, offsets, and a hash in metadataDS.
func New(dataRoot string, metadataDS ds.Datastore) { ... }

// use
fds := flatfs.New(...) 
sfs := shallowfs.New("/", fds)

whyrusleeping · 2015-08-05T20:53:17Z

would be cool if linux supported symlinks to segments of a file...

davidar · 2015-08-08T08:37:40Z

Perhaps separating out the indexing operation (updating the hash->file-segment map) from actually adding files to the repo might work? The indexing could be done mostly separately from ipfs, and you'd be able to manually control what needs to be (re-)indexed. The blockstore then checks if the block has been indexed already (or passes through to the regular datastore otherwise).

striepan · 2015-10-08T08:24:42Z

Copy-on-write filesystems with native deduplication can be relevant here. For example https://btrfs.wiki.kernel.org

Copying files just adds little metadata, data extents are shared. I can use it with big torrents, edit files still being a good citizen and seeding the originals. Additional disk space usage is in the size of the edits.

symlinks to segments of a file

are just files sharing extents

On adding a file that is already in the datastore you could trigger deduplication and save some space!

I am sure there is a lot of other more or less obvious ideas and some more crazy ones like using union mounts(unionfs/aufs) with ipfs as a RO fs with RW fs mounted over it for network live distro installation or going together with other VM stuff floating around here.

jbenet · 2015-10-08T08:35:59Z

@striepan indeed! this all sounds good.

If anyone wants to look into making an fs-repo implementation patch, this could come sooner. (right now this is lower prio than other important protocol things.)

hmeine · 2016-03-01T21:01:30Z

I agree with @striepan; I even believe that copy-on-write filesystems are the solution to this problem. What needs to be done in ipfs, though, is to make sure that the right modern API (kernel ioctl) is used for the copy to be efficient. Probably, go-ipfs just uses native go API for copying, so we should eventually benefit from go supporting recent Linux kernels, right? Can anybody here give a definite status report on that?

Mithgol · 2016-03-10T23:59:43Z

What would happen on Windows? (Are there any copy-on-write filesystems on Windows?)

Kubuxu · 2016-04-14T07:17:33Z

@Mithgol not really, the ipfs cat $HASH | ipfs add can give you different result hash than the input. It is due to changes in encoding, sharding and so on.

This will be especially visible when IPLD is implemented as there will be two encodings active in the network.

Mithgol · 2016-04-15T08:05:26Z

@Kubuxu

Are you implying that I can't ipfs get a file, then clear the cache, then ipfs add and deliver it? Because of the newer IPFS hash (that might be different) my file might not be discovered by the people using an older IPFS hash and the previously created hyperlinks to it?

If that is so, then that should be seen (and treated) literally as a part of the issue currently titled “avoid duplicating files added to ipfs”.

If two files are the same (content-wise), then their distribution and storage should be united in IPFS.

Otherwise storage efforts and distribution efforts are doubled and wasted. Also elements of the (so called) Permanent Web are suddenly not really permanent: when they are lost, they're designed to not be ever found, because even if someone somewhere discovers such lost file in an offline archive and decides to upload it to the Permanent Web, it is likely to yield a different IPFS hash and thus an old hyperlink (which references the original IPFS hash) is still doomed to remain broken forever.

If encodings and shardings and IPLD and maybe a dozen of other inner principles make it inevitable for the same files to have different IPFS hashes, then maybe yet another DHT should be added to the system and it should map IPFS hashes, for example, to cryptographic hashes (and vice versa) and then some subsystem would be able to deduplicate efforts of distribution and storage of the same files, would allow lost files to reappear in the network after uploading.

Mithgol · 2016-04-15T12:03:44Z

However, while this problem should be seen (and treated) literally as a part of the issue currently titled “avoid duplicating files added to ipfs”, this issue is still about deduplicating on disk. It probably is not wise to broaden its discussion here.

I've decided to open yet another issue (ipfs/notes#126) to discuss the advantages (or maybe the necessity) of each file having only one address determined by its content.

whyrusleeping · 2016-04-15T18:12:31Z

@kevina you will need to perform the adds without the daemon running because the daemon and the client arent necessarily on the same machine. If I try to 'zero-copy add' a file clientside, and i'm telling the daemon about this, The daemon has no idea what file i'm talking about, and has no reasonable way to reference that file.

kevina · 2016-04-18T08:03:35Z

Just FYI: I am making good progress on this. The first implementation will basically implement the @jefft0 --no-copy option.

You can find my code at https://github.com/kevina/go-ipfs/tree/issue-875-wip. Expect lots of forced updates on this branch.

kevina · 2016-05-06T05:49:35Z

Sorry for all this noise. It seams GitHub keeps commits around forever even after doing a forced update. I will avoid using issue mentions in most of the commits to avoid this problem. :)

kevina · 2016-05-06T05:50:53Z

The code is now available at https://github.com/ipfs-filestore/go-ipfs/tree/kevina/filestore and is being discussed in pull request #2634,

kevina · 2016-05-10T05:08:09Z

Because this is a major change that might be too big for a single pull request I decided to maintain this as a separate fork while I work through the API issues with whyrusleeping.

I have created a README for the new filestore available here https://github.com/ipfs-filestore/go-ipfs/blob/kevina/filestore/filestore/README.md some notes on my fork are available here https://github.com/ipfs-filestore/go-ipfs/wiki.

At this point I could use testers.

iav · 2016-06-22T17:42:22Z

GreyLink DC++ client uses extended NTFS file attribute to store TigerTree (Merkle-dag) hash of file http://p2p.toom.su/gltth. It allows to not rehash file in case of re-addition or check and recover broken file from the ntwork if copies available.

whyrusleeping · 2017-03-13T05:27:17Z

The filestore code has been merged, and will be shipped in 0.4.7 (try out the release candidate here: https://dist.ipfs.io/go-ipfs/v0.4.7-rc1)

For some notes and usage instructions, see this comment: #3397 (comment)

This issue can finally be closed :)

mycripto11116 · 2017-04-21T01:16:37Z

Why do i get "ERROR :merkledag node was not a directory or shard " while tryig to add a file to ipfs ? Can anyone help please??

whyrusleeping · 2017-04-21T01:29:37Z

@mycripto11116 Could you open a new issue and describe what steps you take to reproduce the issue?

mycripto11116 · 2017-04-21T01:50:57Z

@jeromy thanks I will do that. Just for a quick reply ...I am new to experimenting with IPFS. While trying to add a 370k+ jpg file it does shows the files hash but if I try to view the subblocks with IPFS ls <hash> .. it shows the error mesg Error : merkledag node was not a directory or shard . Regards Jay

ghost · 2017-04-21T02:23:31Z

Blocks are not files, for blocks you'll have to use ipfs block get <hash>

anarcat mentioned this issue Mar 8, 2015

ipfs add horrendously slow #898

Closed

whyrusleeping added the kind/enhancement A net-new feature or improvement to an existing feature label Mar 9, 2015

MichaelMure mentioned this issue Mar 24, 2015

F2F low-diffusion filesharing app #961

Closed

7 tasks

MichaelMure mentioned this issue Apr 7, 2015

F2F low-diffusion filesharing app MichaelMure/Arbore-qt#1

Open

7 tasks

MichaelMure mentioned this issue Jun 12, 2015

/api/v0/get doesn't respect the output argument #1210

Closed

rht mentioned this issue Sep 3, 2015

Error: pin: context deadline exceeded #1630

Closed

ghost added the topic/repo Topic repo label Dec 3, 2015

robcat mentioned this issue Dec 11, 2015

Blocks don't have a nice size #2053

Closed

hackergrrl mentioned this issue Mar 10, 2016

sharing files from their current locations ipfs-inactive/faq#62

Closed

Mithgol mentioned this issue Apr 15, 2016

if the Permanent Web is “content-addressable”, could it be designed so that each file has only one address? ipfs/notes#126

Open

8 tasks

kevina mentioned this issue Apr 22, 2016

[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

Closed

kevina mentioned this issue May 19, 2016

Discussion: Support for Multiple Datastores #2747

Closed

This was referenced Aug 8, 2016

Q3 Roadmap: go-ipfs ipfs/team-mgmt#127

Closed

Handling "broken" pins. #3069

Closed

kevina mentioned this issue Aug 21, 2016

Experimental Proposal: CIDv1 -- IPLD, Multicodec-packed, and more ipfs/specs#130

Closed

parkan mentioned this issue Aug 22, 2016

Sprint: Aug 22 → Aug 28 ipfs/team-mgmt#146

Closed

whyrusleeping added the status/in-progress In progress label Aug 23, 2016

kevina mentioned this issue Sep 14, 2016

[WIP] Filestore Implementation #2634

Closed

6 tasks

whyrusleeping added status/ready Ready to be worked and removed status/in-progress In progress labels Nov 28, 2016

kevina mentioned this issue Jan 19, 2017

Story: [ipfs-pack] Use a Single Pack to Track Many Files Across an OS ipfs-inactive/archives#129

Open

whyrusleeping closed this as completed Mar 13, 2017

whyrusleeping removed the status/ready Ready to be worked label Mar 13, 2017

lidel mentioned this issue Apr 16, 2017

Is it possible to map the blocks to existing files? ipfs-inactive/faq#253

Closed

futpib mentioned this issue Nov 4, 2018

Make add faster by trusting file modification time (mtime), size and path #5734

Open

avoid duplicating files added to ipfs #875

avoid duplicating files added to ipfs #875

Comments

anarcat commented Mar 7, 2015

jbenet commented Mar 7, 2015

anarcat commented Mar 7, 2015

MichaelMure commented Mar 9, 2015

vitzli commented Mar 12, 2015

rubiojr commented Apr 8, 2015

rubiojr commented Apr 8, 2015

whyrusleeping commented Apr 8, 2015

rubiojr commented Apr 8, 2015

rubiojr commented Apr 8, 2015

whyrusleeping commented Apr 8, 2015

jbenet commented Aug 5, 2015

cryptix commented Aug 5, 2015

loadletter commented Aug 5, 2015

whyrusleeping commented Aug 5, 2015

jbenet commented Aug 5, 2015

whyrusleeping commented Aug 5, 2015

davidar commented Aug 8, 2015

striepan commented Oct 8, 2015

jbenet commented Oct 8, 2015

hmeine commented Mar 1, 2016

Mithgol commented Mar 10, 2016

Kubuxu commented Apr 14, 2016

Mithgol commented Apr 15, 2016 • edited Loading

Mithgol commented Apr 15, 2016

whyrusleeping commented Apr 15, 2016

kevina commented Apr 18, 2016 • edited Loading

kevina commented May 6, 2016

kevina commented May 6, 2016

kevina commented May 10, 2016

iav commented Jun 22, 2016

whyrusleeping commented Mar 13, 2017

mycripto11116 commented Apr 21, 2017

whyrusleeping commented Apr 21, 2017

mycripto11116 commented Apr 21, 2017 via email

ghost commented Apr 21, 2017

Mithgol commented Apr 15, 2016 •

edited

Loading

kevina commented Apr 18, 2016 •

edited

Loading