Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

Get all chunk references for a given file #1185

Closed
wants to merge 5 commits into from
Closed

Conversation

holisticode
Copy link
Contributor

This PR adds an endpoint to FileStore which allows to get a list of hashes for a given file.


// testRuns[i] and expectedLen[i] are dataSize and expected length respectively
testRuns := []int{1024, 8192, 16000, 30000, 1000000}
expectedLens := []int{1, 3, 5, 9, 248}
Copy link
Contributor Author

@holisticode holisticode Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test run with a 1000000 data size, I sometimes get 247 and sometimes 248 references....

WHY IS THAT??? @zelig @nolash ?

Copy link
Contributor

@nolash nolash Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only aware of one bug in the pyramidchunker, which occurs - as far as I remember, although this is awhile ago - when you have all batches in a tree filled plus one chunk. Neither 247 or 248 match the count of such a configuration.

Meanwhile, for 1000000 bytes 248 should be correct:

ceil(1000000/4096) = 245
ceil(245/128) = 2

245 data chunks + 2 pointer chunks + 1 root chunk = 248

If you can reproduce this anomaly using the same data (randomized but from a fixed seed), then perhaps we could discover which chunk goes missing, and if it's the same one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is indeed flaky, alternating between 247 and 248, we must find what is going on...

Copy link
Contributor

@nolash nolash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this PR. What is the use-case for it?

return nil, err
}
// collect all references
for _, ref := range putter.References {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that the references are returned in arbitrary order?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nolash is right - we should sort the before returning them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what sort order should this be? By value? By hierarchy? If the latter, how to achieve that when the putter will get the hashes in no specific order?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetical - so that when we have a tool to query N nodes whether they have a list of chunks, they are all sorted in the same order and we can quickly merge and check N such lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will add alphabetical ascending sort order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, alphabetical order would be the best.

// HashExplorer's Put will add just the chunk hashes to its `References`
func (he *HashExplorer) Put(ctx context.Context, chunkData ChunkData) (Reference, error) {
// Need to do the actual Put, which returns the references
ref, err := he.hasherStore.Put(ctx, chunkData)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my, it's pretty clumsy having to hash everything twice... I wonder why the pyramidsplitter only returns the data, not the reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nolash not everything is hashed twice - and the points in time for the requests are separated.

We use Store when we actually really store a data structure on swarm. Then we use the "conventional" PyramidSplit, and having all references for the data structure is (currently) not needed and useless.

We use GetAllReferences for debugging, when we ask a node "do you actually (still?) have a chunk with hash abc123 in your store? Then It will be hashed again (but only by the checking node) to get all references for the given data structure, so it is at another point in time, and optionally for user-selected data structures.

Is the use case clear now?

@holisticode
Copy link
Contributor Author

Reopened upstream as ethereum/go-ethereum#19002

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants