Compute the MD5 hash of the file for Azure Storage #1187

penguoir · 2024-09-16T16:28:06Z

Is your feature request related to a problem? Please describe.
I'm trying to use Ruby on Rails + Tusd. Rails' Active Storage requires a checksum to verify file integrity. Right now, Tusd doesn't compute the hash, so I have to disable the checksum verification on Rails, which is a cumbersome process.

Describe the solution you'd like
When saving blobs to Azure, save their MD5 hash too.

Describe alternatives you've considered

Can generate the hash on Rails after uploading the file to Azure
Can disable integrity verification

Can you provide help with implementing this feature?
Yes happy to help!

Additional context

This was mentioned when Azure storage was added:

#401 (comment)

Copied from that thread:

may I ask what benefit computing the MD5 hash of the file has? (I've never used it, so I am curious) Would you compute the MD5 hash of the block of the blob, or the entire file?

The hash is used to verify the integrity of the blob/file during transport.

Also, you can verify you don't have duplicates on your system.

penguoir · 2024-09-16T16:28:46Z

As a side note, I'd also like to save the metadata (currently saved under a separate blob) in the Azure-provided "metadata" field. But I can open a separate issue for that.

penguoir · 2024-09-17T11:52:56Z

Just had a look at implementing this:

I don't think it's right to add the hash to the info blob. We write the info blob before reading the actual file, so there's no way to get the file's hash into the info blob (unless we update the info blob after the file is uploaded).
I'm not sure whether it's possible to correctly compute the hash of the uploaded file. Depends on:
- Does UploadChunk always run in start-to-finish order? Does it run in parallel or sequentially? We need it to run in start-to-finish order and sequentially for the hash function to work.
How do we handle resumable uploads?

Acconut · 2024-09-17T12:38:49Z

Thanks for bringing this up. tusd currently doesn't have any feature for calculating or comparing checksums of the uploaded data, but I would like to change this in the future while adding support for the tus checksum extension and HTTP digest fields (for draft-ietf-httpbis-resumable-upload). I love to collaborate with you on this if you are interested.

Support for checksums / file integrity checks shouldn't be tied to Azure or any other storage. Instead, calculating the checksum while the data is uploaded is the responsibility of the central upload handling logic (in unrouted_handler.go). The calculated digests can then be used to verify the integrity of the entire upload or individual PATCH requests and be provided to the storage or hooks.

How do we handle resumable uploads?

We would have to store the state of the checksum calculation if the upload is interrupted/saved. If the upload is resumed, we can continue the calculation until the upload is finished.

penguoir · 2024-09-17T14:11:42Z

Sure I can spend a couple hours looking into this over the weekend and see how far I get.

Acconut · 2024-09-17T18:24:17Z

That sounds great! Before heading into an implementation, we can also brainstorm different implementation designs to make sure we cover all requirements.

penguoir added the enhancement label Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute the MD5 hash of the file for Azure Storage #1187

Compute the MD5 hash of the file for Azure Storage #1187

penguoir commented Sep 16, 2024

penguoir commented Sep 16, 2024

penguoir commented Sep 17, 2024

Acconut commented Sep 17, 2024

penguoir commented Sep 17, 2024

Acconut commented Sep 17, 2024

Compute the MD5 hash of the file for Azure Storage #1187

Compute the MD5 hash of the file for Azure Storage #1187

Comments

penguoir commented Sep 16, 2024

penguoir commented Sep 16, 2024

penguoir commented Sep 17, 2024

Acconut commented Sep 17, 2024

penguoir commented Sep 17, 2024

Acconut commented Sep 17, 2024