Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nvcr.io private registry to stage CI internal containers #76

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yhtang
Copy link
Collaborator

@yhtang yhtang commented May 22, 2023

Pros:

  • No longer need to pay for private container storage on ghcr.io
  • Potentially better confidentiality if builds contain NVIDIA IP not authorized for release yet.
  • Potentially more reliable pulling to/from dlcluster/selene
    Cons:
  • (ironically) more difficult to grant access, users need to apply to internal portal first.
  • docker push/pull slightly slower than ghcr.io for CI jobs. Assumed due to better colocation of ghcr.io and Actions servers.

This PR is for preview purposes and may not represent the final work to be merged.

A good compromise may be to use ghcr.io for instantaneous access in jobs, but use a periodically scheduled Actions workflow to archive containers older than a certain time to nvcr.io.

UPDATE: clarification and examples in comment below.

@yhtang yhtang requested review from nouiz and mjsML May 22, 2023 17:43
@yhtang yhtang self-assigned this May 22, 2023
@nouiz
Copy link
Collaborator

nouiz commented May 23, 2023

I don't like the idea of random end user needing to register to get the containers to replicate the CI. I'm good to keep container for shorter without registration. Like only 30 days.

@terrykong
Copy link
Contributor

Shortening the keep window to a month seems really short; especially from the perspective of someone adding a new model. They may start developing their model but then after 1.5 months, they won't be able to reproduce their numbers because the base container is gone.

If we have to make the window 1 month for ghcr, my vote would be to mirror the registry on nvcr.io and have a longer keep window (maybe 6 months).

@nouiz
Copy link
Collaborator

nouiz commented May 23, 2023

This is what I propose, have it on both for 1 months and keep the longer storage on nvcr if we need to lower the memory usage in ghrc.

@yhtang
Copy link
Collaborator Author

yhtang commented May 24, 2023

To clarify, I am only proposing moving containers in the private ghcr.io/nvidia/jax-toolbox-internal repo to nvcr.io/nvidian. This repo is a staging area for containers produced in our CI workflows. Separate jobs exist to copy select containers from this internal repo to the public ones for outside users to check out. These public repos, e,g. ghcr.io/nvidia/jax will always remain on GitHub.

Here is an example of the current container storage solution:

  1. Job "build-jax" creates ghcr.io/nvidia/jax-toolbox-internal:5061628209-jax (user needs to be member of the NVIDIA org on GitHub to access)
  2. Optional job "publish-jax" copies ghcr.io/nvidia/jax-toolbox-internal:5061628209-jax to ghcr.io/nvidia/jax:nightly-2023-05-23 (anyone can access)

Two candidate solutions to use nvcr.io for container storage as discussed above:

  1. Use nvcr.io for private image repo, and ghcr.io for all public image repos. For example:
    • Job "build-jax" creates nvcr.io/nvidian/jax-toolbox-internal:5061628209-jax (needs NGC registration + nvidian membership to access)
    • Optional job "publish-jax" copies nvcr.io/nvidia/jax-toolbox-internal:5061628209-jax to ghcr.io/nvidia/jax:nightly-2023-05-23 (anyone can access)
  2. Use ghcr.io for all repos, but only retains private images for 30 days before transferring them to nvcr.io for archive.
    • Job "build-jax" creates ghcr.io/nvidian/jax-toolbox-internal:5061628209-jax
    • Optional job "publish-jax" copies ghcr.io/nvidia/jax-toolbox-internal:5061628209-jax to ghcr.io/nvidia/jax:nightly-2023-05-23
    • Cron job "archive" moves ghcr.io/nvidia/jax-toolbox-internal:* tags older than 30 days to nvcr.io/nvidian/jax-toolbox-internal:*

@yhtang
Copy link
Collaborator Author

yhtang commented May 24, 2023

and I agree with @nouiz and @terrykong that solution 2 seems a better choice.

@yhtang
Copy link
Collaborator Author

yhtang commented May 24, 2023

This is the cost of private repo storage on ghcr.io:

GitHub charges usage to the account that owns the repository where the package is published. If your account's usage surpasses these limits and you have set a spending limit above $0 USD, you will pay $0.008 USD per GB of storage per day and $0.50 USD per GB of data transfer.

So if we accumulate 10 TB of containers after 6 months, the monthly bill will be 10000 GB x $0.008/GB/day x 31days/month = $2480/month.

FYI pushing and pulling the containers in a CI job using GITHUB_TOKEN does not incur charges.

@nouiz
Copy link
Collaborator

nouiz commented May 24, 2023

I'm good with both of those options. 1 seem simpler. 2 a little better in that it will speed up a little bit the jobs. But I'm not sure it will make an impact in the end.

@yhtang
Copy link
Collaborator Author

yhtang commented Aug 31, 2023

/assistant summarize the entire discussion

@github-actions
Copy link

The discussion revolves around the idea of moving containers to the nvcr.io private registry. Yhtang identifies the benefits, such as cost savings and potentially better confidentiality, but acknowledges the process to grant access might be more complicated and potentially slower for CI jobs. Yhtang suggests a compromise of using ghcr.io for instantaneous access in jobs, and a scheduled workflow to transfer older containers to nvcr.io.

Nouiz expresses concern about the inconvenience for end-users to register to access the containers to replicate the CI. He proposes to keep the containers for 30 days.

Terrykong agrees with Nouiz but raises a concern about a 30-day storage limit from a developer's perspective. He suggests mirroring the registry on nvcr.io and setting a longer storage limit, such as six months.

Nouiz then suggests keeping it on both for 1 month and using nvcr.io for longer storage if memory usage needs to be reduced at ghrc.

Yhtang further explains his proposal by presenting two possible approaches: 1) Using nvcr.io for the private image repository and ghcr.io for all public image repositories. 2) Using ghcr.io for all repositories but retaining private images for 30 days before moving them to nvcr.io for archiving.

In terms of storage costs, Yhtang cites the cost of private repo storage on ghcr.io, illustrating that if they amassed 10 TB of containers over six months, the monthly cost would be high.

In the end, Nouiz is fine with both options presented by Yhtang. He suggests Option 1 is simpler, while Option 2 might speeding up jobs slightly, though he doubts it will make a substantial impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants