Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5x loss+perf baselines #256

Merged
merged 18 commits into from
Oct 12, 2023
Merged

T5x loss+perf baselines #256

merged 18 commits into from
Oct 12, 2023

Conversation

maanug-nv
Copy link
Contributor

  • print test metrics
  • update baseline scripts to support T5x
  • add t5x test baselines

@terrykong terrykong mentioned this pull request Sep 25, 2023
@yhtang
Copy link
Collaborator

yhtang commented Sep 26, 2023

@maanug-nv Could you please update the description with an example of the mechanism in effect?

@maanug-nv
Copy link
Contributor Author

@yhtang Which description? PR description, or in one of the scripts, or in the README?
Also do you just want example call of create_baselines.sh, or something more? If you can provide more detail, it will be helpful.

terrykong
terrykong previously approved these changes Sep 28, 2023
@terrykong
Copy link
Contributor

@yhtang bumping for @maanug-nv 's Q

@maanug-nv
Copy link
Contributor Author

Added .github/workflows/baselines/test_t5x_mgmn_metrics.py, and the pytest command.
https://github.com/NVIDIA/JAX-Toolbox/actions/runs/6386266144

@maanug-nv maanug-nv force-pushed the maanug/capture-metrics-t5x branch 4 times, most recently from 4ba1456 to 9987b10 Compare October 9, 2023 22:52
@maanug-nv
Copy link
Contributor Author

I've tuned the relative tolerances a few times, all test cases are passing in this run: https://github.com/NVIDIA/JAX-Toolbox/actions/runs/6475410065
Pls review the T5x loss relative tolerances

@terrykong
Copy link
Contributor

The rtol LGTM. Looks like #292 may have introduced conflicts that you have to resolve in summarize_metrics.py, but assuming the CI passes, I think we're good to merge.

@maanug-nv
Copy link
Contributor Author

resolved, if everything passes FFTM

@terrykong
Copy link
Contributor

Manual run of t5x MGMN since t5x build is failing due to a known issue: https://github.com/NVIDIA/JAX-Toolbox/actions/runs/6492103243

@terrykong terrykong self-requested a review October 12, 2023 16:29
@terrykong terrykong merged commit 19c3c09 into main Oct 12, 2023
69 of 71 checks passed
@terrykong terrykong deleted the maanug/capture-metrics-t5x branch October 12, 2023 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants