Skip to content

Checkpointing and retrieval #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 16, 2024
Merged

Checkpointing and retrieval #13

merged 4 commits into from
Apr 16, 2024

Conversation

NicolasRR
Copy link
Contributor

Implemented checkpoint and retrieval for model, scheduler, and optimizer state_dict as well as random generator states for reproducibility.

@mkrima
Copy link
Collaborator

mkrima commented Apr 16, 2024

LGTM

@mkrima mkrima merged commit c542d4b into epfml:main Apr 16, 2024
@haeggee
Copy link
Collaborator

haeggee commented Apr 16, 2024

quick question: without having gone through the code in detail, does this also make sure the data sampling is deterministic (dataloader state is restored)? what i mean is, does the model see the same data when loading from a checkpoint vs when doing a full training run

@NicolasRR
Copy link
Contributor Author

Good point, I have added those modifications in another PR

@mkrima
Copy link
Collaborator

mkrima commented Apr 16, 2024

Sorry I think I merged this too quickly. I forgot that we switched to using dataloader in this repo. The PR resets the rng state but our dataloader has its own sampler with its own generator. so we have to set the rng state for that too. I can do that in another PR in our hacakthon tomorrow:
https://stackoverflow.com/questions/60993677/how-can-i-save-pytorchs-dataloader-instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants