Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-generate and serve simple index metadata #8487

Open
Tracked by #10672
woodruffw opened this issue Aug 28, 2020 · 15 comments
Open
Tracked by #10672

Pre-generate and serve simple index metadata #8487

woodruffw opened this issue Aug 28, 2020 · 15 comments

Comments

@woodruffw
Copy link
Member

What's the problem this feature will solve?

As part of the TUF rollout (#7488), we will need to store hashes for the simple indices that pip and other resolvers use.

These indices are currently generated dynamically from a template when requested, making that difficult. Instead, they should be generated once per relevant event (file upload/release) and stored somewhere (probably GCS). Stale indices should not be deleted from the store, as the TUF metadata may still refer to them.

cc @ewdurbin @dstufft

@di
Copy link
Member

di commented Aug 28, 2020

For those that aren't as familiar with TUF, a few additional questions about this:

  • what hashes should we store?
  • do we need to include other metadata with the file as well? timestamp?
  • do we need this for /simple or just /simple/projectname?
  • how would TUF metadata refer to an old index?

@woodruffw
Copy link
Member Author

Yep! Thanks for the clarifying questions.

  • what hashes should we store?

TUF is using BLAKE2 for the other target metadata (i.e., actually distribution packages), so it probably makes sense to use it here as well.

  • do we need to include other metadata with the file as well? timestamp?

I don't believe so; I think just the file itself should be sufficient. @jku may be able to correct me here, if I'm missing something.

  • do we need this for /simple or just /simple/projectname?

Just /simple/projectname, for TUF purposes. My understanding (again, please correct me if I'm wrong!) is that no resolvers currently use /simple, and TUF won't be using it whatsoever.

  • how would TUF metadata refer to an old index?

This is probably the knottiest part. My first thought for this is that /simple/projectname should be mapped to /simple/{hash}-projectname, where {hash} is the BLAKE2 content hash. The TUF metadata would only ever refer to the {hash}-projectname variant, ensuring that we always fetch a version of the index that's consistent with the other target metadata.

@jku
Copy link
Contributor

jku commented Aug 28, 2020

Yeah this all seems correct to me. The TUF metadata for a project index will look roughly like this:

   "sampleproject": {
    "custom": {},
    "hashes": {
     "blake2b": "7a8bc0d0a15f6289b184f86a56d398467fd465c6293535182ba0f6cc2f04e703",
    },
    "length": 3080
   }

A client that sees this metadata will download https://pypi.org/simple/7a8bc0d0a15f6289b184f86a56d398467fd465c6293535182ba0f6cc2f04e703.sampleproject to ensure it's getting exactly the version of sampleproject index it wants. This should work for all hashes mentioned in the metadata (but I suppose warehouse will only use blake2b).

I'll mention that we can of course agree on a different path for index files if e.g. you don't want to pollute the "/simple/*" namespace with so many new items. If we do that the path should be relative to the pypi index url though (so that the path can be found on all warehouse instances without configuration). Something like https://pypi.org/simple/.project-indexes/ would be fine to me.

@dstufft
Copy link
Member

dstufft commented Aug 30, 2020

Another option would be to do something like: /simple/PROJECT/HASH/.

That would make it easy for mirrors to keep all of the related files colocated, to enable deletion and cleanup work without having to track where those files are for a specific project.

@jku
Copy link
Contributor

jku commented Aug 30, 2020

Another option would be to do something like: /simple/PROJECT/HASH/.

The reasoning is sound but the TUF client implementation currently expects the target name to include a filename that will then be prefixed with hash: this could of course be worked around but alternatively something like /simple/PROJECT/HASH.index.html or
/simple/PROJECT/HASH.PROJECT would work out of the box.

@dstufft
Copy link
Member

dstufft commented Aug 30, 2020

Yea those are fine with me.

@di
Copy link
Member

di commented Sep 18, 2020

Chatting with @ewdurbin today, we determined that /simple/PROJECT/HASH.PROJECT is required since we don't actually produce any index.html files.

A first pass at this is in #8586.

@jku
Copy link
Contributor

jku commented Oct 2, 2020

Something I did not think when we last discussed this: It might be a good idea to not use the project name in the file name itself because of filename length limits: so I would suggest something like /simple/<PROJECT>/<HASH>.index.html.

This is not a practical issue right now (blake2b hash is 64 bytes and longest project name on pypi seems to be 80 bytes: still far from the 255 byte limit) but avoiding the potential problem seems like a good idea if doing so is painless.

@di
Copy link
Member

di commented Oct 2, 2020

Since we can't use index.html, does /simple/<PROJECT>/<HASH> work?

The longest is 80 characters but I'm not sure where the practical limit for this comes from, if any: https://pypi.org/project/Aaaaaaaaaaaaaaaaaaa-aaaaaaaaa-aaaaaaasa-aaaaaaasa-aaaaasaa-aaaaaaasa-bbbbbbbbbbb/

@jku
Copy link
Contributor

jku commented Oct 3, 2020

TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with <HASH>. before doing the request. I think can workaround that assumption in pip if /simple/<PROJECT>/<HASH> is what works for you -- but will have to check that, I'll get back to that on monday or tuesday.

@jku
Copy link
Contributor

jku commented Oct 5, 2020

I think I theoretically can workaround /simple/<PROJECT>/<HASH> in the client (pip) code but only with an awful hack so I won't do that.

I think the reasonable options are:

  • warehouse provides a URL that ends in /<HASH>.<FILENAME> -- the filename does not have to be index.html, anything will work (even something dynamic like the project name although my earlier comments about filename length stand).
  • if that's not possible then the TUF community needs to redesign the client API to be more flexible

I couldn't quite follow why 'index.html' was problematic so do let me know if the first option is not on the table: I'll have to start a discussion in TUF community in that case.

@woodruffw
Copy link
Member Author

Bumping the question about index.html -- I think I might have also missed the reason why it can't be used (either as /HASH/index.html or HASH.index.html).

Alternatively, would something like /simple/PROJECT/HASH.detail.html work?

@di
Copy link
Member

di commented Oct 8, 2020

Bumping the question about index.html -- I think I might have also missed the reason why it can't be used (either as /HASH/index.html or HASH.index.html).

It could be used but it doesn't make much sense as an endpoint within our routes -- there is no index.html file so we would sort of be hacking it in as an endpoint.

TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with . before doing the request.

As such, this is kind of a poor assumption, because virtually all of our routes don't have "filenames", including the ones in question here (unless you consider the last part of the path a filename).

If we say that project names are constrained to a maximum of 80 characters, is there any reason why /simple/<HASH>.<PROJECT_NAME> wouldn't work? That seems to be more aligned with TUF's desire to take the last piece of the URL and prepend <HASH>. to it, right?

@jku
Copy link
Contributor

jku commented Oct 9, 2020

TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with . before doing the request.

As such, this is kind of a poor assumption, because virtually all of our routes don't have "filenames", including the ones in question here (unless you consider the last part of the path a filename).

I totally agree (I can also understand how they ended up with that design -- the focus was on passive systems where the targets and metadata are pre-generated and then served by a dumb fileserver). I'm just pointing out that the URL must end with /<HASH>.<SOMENAME> or we have to do some redesign work in the TUF client API: both options are valid.

If we say that project names are constrained to a maximum of 80 characters, is there any reason why /simple/<HASH>.<PROJECT_NAME> wouldn't work? That seems to be more aligned with TUF's desire to take the last piece of the URL and prepend <HASH>. to it, right?

Sure that works.

@woodruffw
Copy link
Member Author

That works for me as well! Thanks for the explanation, @di!

@di di mentioned this issue Feb 1, 2022
52 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants