Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article on data import options & tradeoffs #1715

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions docs/how-to/data-import.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: Data import in Kubo
description: IPFS nodes can be customized using the configuration file. The default values should be fine for most use-cases. However, you may want to make some changes if you are running a specialized IPFS node, or simply want to tweak things to your liking.
---

# Options and tradeoffs around data import in Kubo

IPFS Kubo commands has multiple data import commands, which are as follows:

- `ipfs add`
- `ipfs dag put`
- `ipfs block put`

While many users are ok with the default chunker and has functions for these commands, the `ipfs add` supports a much wider range of options:

Check failure on line 14 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L14

[ipfs-docs-style.PLNSpelling] Did you really mean 'chunker'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'chunker'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 14, "column": 42}}}, "severity": "ERROR"}

- Custom `--chunker` and `--hash` functions
- Using a `TrickleDAG` instead a MerkleDAG with the `--trickle` option
- `--raw-leaves`

## Defaults

### Chunk size

The current default chunk size is `256` * `1024` bytes, or `262144` bytes.

### Chunking algorithmn

Check failure on line 26 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L26

[ipfs-docs-style.PLNSpelling] Did you really mean 'algorithmn'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'algorithmn'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 26, "column": 14}}}, "severity": "ERROR"}



## Options

### --trickle

The `--trickle` (`-t`) option changes to the default DAG type from MerkleDAG to TrickleDAG. Unlike a MerkleDAG, which uses a tree structure, a TrickleDAG is structured as a list of lists.

As such, TrickleDAGs are optimized for reading data in sequence, whereas MerkleDAGs are optimized for random access.

### --chunker

The `--chunker` (`-s`) option dictates the chunking strategy; in other words, how to break files into blocks. Blocks with same content can
be deduplicated. Different chunking strategies will produce different

Check failure on line 41 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L41

[ipfs-docs-style.PLNSpelling] Did you really mean 'deduplicated'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'deduplicated'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 41, "column": 4}}}, "severity": "ERROR"}
hashes for the same file. The default is a fixed block size of
256 * 1024 bytes, 'size-262144'. Alternatively, you can use the
Buzhash or Rabin fingerprint chunker for content defined chunking by

Check failure on line 44 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L44

[ipfs-docs-style.PLNSpelling] Did you really mean 'Buzhash'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'Buzhash'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 44, "column": 1}}}, "severity": "ERROR"}

Check failure on line 44 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L44

[ipfs-docs-style.PLNSpelling] Did you really mean 'chunker'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'chunker'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 44, "column": 30}}}, "severity": "ERROR"}
specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer

Check failure on line 45 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L45

[ipfs-docs-style.PLNSpelling] Did you really mean 'buzhash'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'buzhash'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 45, "column": 12}}}, "severity": "ERROR"}
to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.

The following examples use very small byte sizes to demonstrate the
properties of the different chunkers on a small file. You'll likely

Check failure on line 49 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L49

[ipfs-docs-style.PLNSpelling] Did you really mean 'chunkers'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'chunkers'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 49, "column": 29}}}, "severity": "ERROR"}
want to use a 1024 times larger chunk sizes for most files.

> ipfs add --chunker=size-2048 ipfs-logo.svg
added QmafrLBfzRLV4XSH1XcaMMeaXEUhDJjmtDfsYU95TrWG87 ipfs-logo.svg
> ipfs add --chunker=rabin-512-1024-2048 ipfs-logo.svg
added Qmf1hDN65tR55Ubh2RN1FPxr69xq3giVBz1KApsresY8Gn ipfs-logo.svg

Check failure on line 55 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L55

[ipfs-docs-style.PLNSpelling] Did you really mean 'ipfs'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'ipfs'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 55, "column": 56}}}, "severity": "ERROR"}

You can now check what blocks have been created by:

> ipfs object links QmafrLBfzRLV4XSH1XcaMMeaXEUhDJjmtDfsYU95TrWG87

Check failure on line 59 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L59

[ipfs-docs-style.PLNSpelling] Did you really mean 'ipfs'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'ipfs'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 59, "column": 5}}}, "severity": "ERROR"}
QmY6yj1GsermExDXoosVE3aSPxdMNYr6aKuw3nA8LoWPRS 2059
Qmf7ZQeSxq2fJVJbCmgTrLLVN9tDR9Wy5k75DxQKuz5Gyt 1195
> ipfs object links Qmf1hDN65tR55Ubh2RN1FPxr69xq3giVBz1KApsresY8Gn

Check failure on line 62 in docs/how-to/data-import.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/how-to/data-import.md#L62

[ipfs-docs-style.PLNSpelling] Did you really mean 'ipfs'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'ipfs'?", "location": {"path": "docs/how-to/data-import.md", "range": {"start": {"line": 62, "column": 5}}}, "severity": "ERROR"}
QmY6yj1GsermExDXoosVE3aSPxdMNYr6aKuw3nA8LoWPRS 2059
QmerURi9k4XzKCaaPbsK6BL5pMEjF7PGphjDvkkjDtsVf3 868
QmQB28iwSriSUSMqG2nXDTLtdPHgWb4rebBrU7Q1j4vxPv 338

Finally, a note on hash determinism. While not guaranteed, adding the same
file/directory with the same flags will almost always result in the same output
hash. However, almost all of the flags provided by this command (other than pin,
only-hash, and progress/status related flags) will change the final hash.
Loading