pip install torch==2.2.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html
pip install torchvision==0.17.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html
pip install kappamodules
pip install kappaschedules
pip install kappaconfig
pip install kappaprofiler
pip install wandb
pip install einops
pip install torchmetrics
pip install kappadata
Training models with this codebase requires some configurations such as where the code can find datasets or where to put logs.
The file static_config.yaml
defines all kinds of paths that are specific to your setup:
- where to store checkpoints/logs (
output_path
) - from where to load pre-trained models (
model_path
) - from where to load data (
global_dataset_paths
)
Some additional configurations are contained:
local_dataset_path
if this is defined, data will be copied before training to this location. This is typically used if there is a "slow" global storage and compute nodes have a fast local storage (such as a fast SSD).default_wandb_mode
how you want to log to Weights and Biasesdisabled
dont log to W&Bonline
use W&B in the "online" mode, i.e. such that you can see live updates in the web interfaceoffline
use W&B in the "offline" mode. This has to be used if compute nodes dont have internet access. You can sync the W&B logs after the run has finished to inspect it via the web interface.- if
online
oroffline
is used you will need to create awandb_config
(see below)
To get started copy setup/static_configs/prod.yaml
, rename it to static_config.yaml
and adapt it to your setup.
You can log to W&B by setting a wandb_mode
. Set it in the static_config.yaml
via default_wandb_mode
.
You can define to which W&B project you want to log to via a wandb: <CONFIG_NAME>
field in a yaml file that defines your run.
All provided yamls by default use the name v4
as <CONFIG_NAME>
. To use the same config as defined in the provided
yamls create a folder wandb_configs
, copy the template_wandb_config.yaml
into this folder, change
entity
/project
in this file and rename it to v4.yaml
.
Every run that defines wandb: v4
will now fetch the details from this file and log your metrics to this W&B project.
This codebase supports runs in SLURM environments. For this, you need to provide some additional configurations.
Copy the setup/sbatch_configs/prod.yaml
, rename it to sbatch_config.yaml
and adjust the values to your setup.
Copy the setup/sbatch_templates/nodes.sh
, rename it to template_sbatch_nodes.sh
and adjust the values to your setup.
Copy the setup/sbatch_templates/gpus.sh
, rename it to template_sbatch_gpus.sh
and adjust the values to your setup.
You can start runs with the main_train.py
file. For example
You can queue up runs in SLURM environments by running python main_sbatch.py --hp <YAML> --time <TIME> --nodes <NODES>
which will queue up a run that uses the hyperparameters from <YAML>
and queues up a run on <NODES>
nodes.
All hyperparameters have to be defined in a yaml file that is passed via the --hp <YAML>
CLI argument.
You can start runs on "normal" servers or SLURM environments.
Define how many (and which) GPUs you want to use with the --devices
CLI argument
--devices 0
will start the run on the GPU with index 0--devices 2
will start the run on the GPU with index 2--devices 0,1,2,3
will start the run on 4 GPUs
Examples:
python main_train.py --devices 0,1,2,3 --hp yamls/stage2/l16_mae.yaml
python main_train.py --devices 0,1,2,3,4,5,6,7 --hp yamls/stage3/l16_mae.yaml
To start runs in SLURM environments, you need to setup the configurations for SLURM as outlined above.
Then start runs with the main_sbatch.py
script.
Example:
python main_sbatch.py --time 24:00:00 --nodes 4 --hp yamls/stage3/l16_mae.yaml
You can run many yamls by creating a folder yamls_run
, copying all yamls that you want to run
into that folder and then running python main_run_folder.py --devices 0 --folder yamls_run
.
Add these flags to your python main_train.py
or python main_sbatch.py
command to resume from a checkpoint.
--resume_stage_id <STAGGE_ID>
resume fromcp=latest
--resume_stage_id <STAGGE_ID> --resume_checkpoint E100
resume from epoch 100--resume_stage_id <STAGGE_ID> --resume_checkpoint U100
resume from update 100--resume_stage_id <STAGGE_ID> --resume_checkpoint S1024
resume from sample 1024
Add a resume initializer to the trainer
trainer:
...
initializer:
kind: resume_initializer
stage_id: ???
checkpoint:
epoch: 100