Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crate file size upload limits #794

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Kixiron
Copy link
Member

@Kixiron Kixiron commented May 30, 2020

Limits the maximum file size for uploaded files, configurable via Limits. The current default I have right now is 5 GB, but I just set that because I didn't know what the preferred number would be

Closes #588

@Kixiron
Copy link
Member Author

Kixiron commented May 30, 2020

I considered testing this via File::set_len on a dummy, but idk if a ~5 GB file is something we want to impose on testing

@Mark-Simulacrum
Copy link
Member

FWIW I wouldn't expect File::set_len on most filesystems/platforms to actually commit anything to disk, just adjust metadata.

@Kixiron
Copy link
Member Author

Kixiron commented May 30, 2020

According to the docs it allocates zeroed memory, but that doesn’t matter because I can just lower the limit to something small, no idea how I didn’t realize that

@Nemo157
Copy link
Member

Nemo157 commented May 30, 2020

Out of the docs I have, the largest file is

91513860        docs/web-sys/0.3.35/x86_64-pc-windows-msvc/src/web_sys/opt/rustwide/target/x86_64-pc-windows-msvc/debug/build/web-sys-caeee69dd050a8ab/out/bindings.rs.html

there's a few of those for the different platforms, then it drops down to a handful of ~10-15MB files from other crates. The largest non-src file is

12291596        docs/stm32f0/0.10.0/x86_64-pc-windows-msvc/search-index-20200414-1.44.0-nightly-edc02580e.js

@Kixiron
Copy link
Member Author

Kixiron commented May 30, 2020

Does 500MB sound like a reasonable default per-file limit?

@jyn514
Copy link
Member

jyn514 commented May 31, 2020

I'm not sure this belongs in docs.rs.

a) For files that are part of the tarball, I think filtering based on file size should happen on crates.io, if at all. In particular, I don't like the idea of having crates that can be published but then will be automatically rejected by docs.rs, that seems like it should need an RFC at the minimum.

b) For files that are generated, I'm not sure I see the use case. The files that will be blocked are ... none, unless rustdoc starts generating enormously larger HTML files? In that case why do we need the limit? The only exception I see is files created by build.rs, but I think that should be addressed by limited disk usage during builds, not disk usage during uploading (or otherwise the server itself will run out of disk space before finishing the build).

@Kixiron
Copy link
Member Author

Kixiron commented May 31, 2020

The point of the limit isn't to try and save space or to ever actually be set off by rustdoc, it's purely for included files of an absurd size as mentioned in the issue, e.g. a crate that has a massive video file in it. Filtering input files somewhat is part of the 101s of web security, even if Crates.io did this for us, I'd still advocate for implementing this because it's a very simple fix that helps with security. And I don't really think an RFC would be needed at all, as A. The limit is absurdly high and B. It's easily fixable (in the unlikely event that someone triggers it in a meaningful way), as the way I implemented it was similar to other limits implemented on builds, such as memory and build timeouts. This is also a better solution than limiting disk usage, as it allows crates to generate whatever weird, large files they may need (and allowing the build to work perfectly), but then just not uploading said large weird files

@jyn514
Copy link
Member

jyn514 commented May 31, 2020

it's purely for included files of an absurd size

I think that should be handled by crates.io if at all.

Filtering input files somewhat is part of the 101s of web security, even if Crates.io did this for us, I'd still advocate for implementing this because it's a very simple fix that helps with security.

That's only relevant if we accept arbitrary uploads. We only accept uploads that were previously accepted by crates.io.

This is also a better solution than limiting disk usage, as it allows crates to generate whatever weird, large files they may need (and allowing the build to work perfectly), but then just not uploading said large weird files

I actually think silently not uploading the file is worse than failing the build because you don't notice it's not there until you try to visit the page.

Also, you haven't addressed the concern about running out of disk space during a build (which admittedly is somewhat of a separate issue).

@Kixiron
Copy link
Member Author

Kixiron commented May 31, 2020

For the first two comments, I'll restate that even if Crates.io did this for us, I'd still advocate for the change. It's incredibly simple to implement, understand, test, etc (and I've already done it, so it's not a question of future work either), won't affect performance whatsoever and is just general good practice, even though we're getting docs from a trusted source. The actual change itself is a singular line of code, just an if file.metadata().unwap().len() <= threshold that could potentially save us a lot of pain in the future

I actually think silently not uploading the file is worse than failing the build because you don't notice it's not there until you try to visit the page.

It could be implemented to fail the build as well, but I want to reiterate on the fact that this should never affect a normal crate. Anyone who sets this off is either malicious (and attempting to attack us/crates.io) or at minimum is engaging in terrible practice of making very large build artifacts outside of the target/ directory. The former should be guarded against and the latter is easily fixed, as this only matters to very large files that we are attempting to send to s3

Also, you haven't addressed the concern about running out of disk space during a build (which admittedly is somewhat of a separate issue).

That sounds like a separate docker-based PR, adding a limit to the container itself

@pietroalbini
Copy link
Member

For files that are part of the tarball, I think filtering based on file size should happen on crates.io, if at all. In particular, I don't like the idea of having crates that can be published but then will be automatically rejected by docs.rs, that seems like it should need an RFC at the minimum.

crates.io does indeed have a limit, but we override it for some crates if they legitimate have a need for more space. Here is the current exceptions:

crates-io::DATABASE=> select (max_upload_size / 1024 / 1024) AS upload_size_mb, count(*) from crates where max_upload_size is not null group by max_upload_size;
 upload_size_mb | count
----------------+-------
             15 |     2
             20 |     5
             50 |     1
             55 |     1
            300 |     1

That's only relevant if we accept arbitrary uploads. We only accept uploads that were previously accepted by crates.io.

Unfortunately build scripts and proc macros are a thing :(

It could be implemented to fail the build as well, [...]

We will definitely want to fail the build.

@jyn514
Copy link
Member

jyn514 commented Jun 1, 2020

Ok, if we're going to implement this it should be a restriction on the build, not on the uploads, and it should be per-crate, not per-file (or a malicious crate could upload a thousand files one byte less than the max).

@pietroalbini it doesn't look like Rustwide currently has an API for limiting the storage space on https://docs.rs/rustwide/0.7.1/rustwide/cmd/struct.SandboxBuilder.html, could that be added?

@jyn514
Copy link
Member

jyn514 commented Jun 1, 2020

The reason I'm so adamant about limiting this during the build is because otherwise we could still have DOS attacks which cause the site to go down when the hard drive fills up; having more storage costs is much less of an issue.

@pietroalbini
Copy link
Member

I think we should have both limits on the disk space used and the uploads. There is no reason to block this PR only because a different limit is not enforced yet.

For the default limit, we should actually analyze the S3 inventory and see how much storage crates actually use.

@jyn514
Copy link
Member

jyn514 commented Jun 11, 2020

For the default limit, we should actually analyze the S3 inventory and see how much storage crates actually use.

Do you know a reasonably efficient way to do that? Just querying the size of each file on S3 will take a very long time because we have so many.

@pietroalbini
Copy link
Member

We have the inventory of all the files present in the bucket already.

@jyn514 jyn514 added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filtering uploaded files
5 participants