Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute the MD5 hash of the file for Azure Storage #1187

Open
penguoir opened this issue Sep 16, 2024 · 5 comments
Open

Compute the MD5 hash of the file for Azure Storage #1187

penguoir opened this issue Sep 16, 2024 · 5 comments

Comments

@penguoir
Copy link

Is your feature request related to a problem? Please describe.
I'm trying to use Ruby on Rails + Tusd. Rails' Active Storage requires a checksum to verify file integrity. Right now, Tusd doesn't compute the hash, so I have to disable the checksum verification on Rails, which is a cumbersome process.

Describe the solution you'd like
When saving blobs to Azure, save their MD5 hash too.

Describe alternatives you've considered

  • Can generate the hash on Rails after uploading the file to Azure
  • Can disable integrity verification

Can you provide help with implementing this feature?
Yes happy to help!

Additional context

This was mentioned when Azure storage was added:

#401 (comment)

Copied from that thread:

may I ask what benefit computing the MD5 hash of the file has? (I've never used it, so I am curious) Would you compute the MD5 hash of the block of the blob, or the entire file?

The hash is used to verify the integrity of the blob/file during transport.

Also, you can verify you don't have duplicates on your system.

@penguoir
Copy link
Author

As a side note, I'd also like to save the metadata (currently saved under a separate blob) in the Azure-provided "metadata" field. But I can open a separate issue for that.

@penguoir
Copy link
Author

Just had a look at implementing this:

  • I don't think it's right to add the hash to the info blob. We write the info blob before reading the actual file, so there's no way to get the file's hash into the info blob (unless we update the info blob after the file is uploaded).
  • I'm not sure whether it's possible to correctly compute the hash of the uploaded file. Depends on:
    • Does UploadChunk always run in start-to-finish order? Does it run in parallel or sequentially? We need it to run in start-to-finish order and sequentially for the hash function to work.
  • How do we handle resumable uploads?

@Acconut
Copy link
Member

Acconut commented Sep 17, 2024

Thanks for bringing this up. tusd currently doesn't have any feature for calculating or comparing checksums of the uploaded data, but I would like to change this in the future while adding support for the tus checksum extension and HTTP digest fields (for draft-ietf-httpbis-resumable-upload). I love to collaborate with you on this if you are interested.

Support for checksums / file integrity checks shouldn't be tied to Azure or any other storage. Instead, calculating the checksum while the data is uploaded is the responsibility of the central upload handling logic (in unrouted_handler.go). The calculated digests can then be used to verify the integrity of the entire upload or individual PATCH requests and be provided to the storage or hooks.

  • How do we handle resumable uploads?

We would have to store the state of the checksum calculation if the upload is interrupted/saved. If the upload is resumed, we can continue the calculation until the upload is finished.

@penguoir
Copy link
Author

Sure I can spend a couple hours looking into this over the weekend and see how far I get.

@Acconut
Copy link
Member

Acconut commented Sep 17, 2024

That sounds great! Before heading into an implementation, we can also brainstorm different implementation designs to make sure we cover all requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants