Instrumenting your run

We assume you're following the structure of the arch-and-scaling template Go to https://huggingface.co/ and create two models (currently, under your icon on the top right/new model)

<YOUR_MODEL_NAME>-checkpoints
<YOUR_MODEL_NAME>-logs in your output path (DATA_OUTPUT_PATH in the arch-and-scaling template), git clone the logs repo and rename the folder to logs (mv <YOUR_MODEL_NAME>-logs logs)

How to synch your logs with the hub

python tools/hub-sync.py --repo-path <DATA_OUTPUT_PATH>/logs/tensorboard/ --patterns "*tfevent*"

How to synch your checkpoints with the hub

Latest version of what was used in training 1.

Go to your checkpoints folder, which should contain a bunch of global_stepXXXXXX folders. Open a long running interactive shell:

srun -p compil --cpus-per-task=40 -A six@cpu --time=6:00:00 --pty bash

then convert:

time find * -maxdepth 0 -type d -name "global_step*" -exec $six_ALL_CCFRWORK/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py --input_folder {} --output_folder ../hf-fixed/{} \;

to prepare the target dir:

#git -c http.extraHeader="Authorization: Basic " clone https://huggingface.co/bigscience/<YOUR_REPO>/
cd YOUR_REPO
huggingface-cli lfs-enable-largefiles .
git config --unset user.email
~/prod/code/bigscience/tools/hub-sync.py --repo-path . --patterns '*bogus*'

We are going to put each checkpoint into its own branch with the same name.

If you have added tokenizer files:

mv ../hf_fixed/global_step* .
time find * -maxdepth 0 -type d -name "global_step*" -exec git checkout main \; -exec git checkout -b {} \; -exec mv {}/config.json . \; -exec mv {}/pytorch_model.bin . \; -exec git add config.json pytorch_model.bin <TOKENIZER_FILES> \; -exec git commit -m "add {}" \; -exec git push --set-upstream origin {} \; --exec mv config.json {}/ --exec mv pytorch_model.bin {}/;
git checkout main

If you just want to add the checkpoints, without tokenizer files:

mv ../hf_fixed/global_step* .
time find * -maxdepth 0 -type d -name "global_step*" -exec git checkout main \; -exec git checkout -b {} \; -exec mv {}/config.json . \; -exec mv {}/pytorch_model.bin . \; -exec git add config.json pytorch_model.bin \; -exec git commit -m "add {}" \; -exec git push --set-upstream origin {} \; --exec mv config.json {}/ --exec mv pytorch_model.bin {}/
git checkout main

If you want to add tokenizer files later:

time find * -maxdepth 0 -type d -name "global_step*" -exec git checkout main \; -exec git checkout {} \; -exec git add <TOKENIZER_FILES> \; -exec git commit -m "add {}" \; -exec git push --set-upstream origin {} \;
git checkout main

Fast branch switching in case you messed up and want to fix all your checkpoints

What you want is export GIT_LFS_SKIP_SMUDGE=1. Here's an example that changes the activation function in the config.json files for each branch:

export GIT_LFS_SKIP_SMUDGE=1
git clone https://huggingface.co/bigscience/tr3e-1B3-c4-checkpoints
cd tr3e-1B3-c4-checkpoints
~/prod/code/bigscience/tools/hub-sync.py --repo-path . --patterns '*bogus*'
set +H
git branch -a | sort -V | perl -lne 'm|(global_step\d+)| && print qx[git checkout $1; perl -pi -e "s/gelu(?!_)/gelu_fast/" $1/config.json; git commit -m "gelu_fast is the correct activation_function" .; git push --set-upstream origin $1]'
export GIT_LFS_SKIP_SMUDGE=0

And an example that fixes checkpoints in the old format (contained within a global_step subfolder, no tokenizer files) to be compatible with from_pretrained:

export GIT_LFS_SKIP_SMUDGE=1
my_callback () {
  INDEX=${1}
  BRANCH=${2}
  if [[ $BRANCH == origin/global_step* ]];
  then
    git checkout "${BRANCH:7}"
    git mv "${BRANCH:7}"/* .
    cp ../gpt2_tokenizer/tokenizer.json .
    git add tokenizer.json
    git commit -m "fixed checkpoints to be from_pretrained-compatible"
    git push
  fi
}
get_branches () {
  git branch --all --format='%(refname:short)'
}
# mapfile -t -C my_callback -c 1 BRANCHES < <( get_branches ) # if you want the branches that were sent to mapfile in a new array as well
# echo "${BRANCHES[@]}"
mapfile -t -C my_callback -c 1 < <( get_branches )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Instrumenting your run

How to synch your logs with the hub

How to synch your checkpoints with the hub

Fast branch switching in case you messed up and want to fix all your checkpoints

Files

README.md

Latest commit

History

README.md

File metadata and controls

Instrumenting your run

How to synch your logs with the hub

How to synch your checkpoints with the hub

Fast branch switching in case you messed up and want to fix all your checkpoints