GitHub - 28andrew/DynamicMLTrainer: Automatically partition PyTorch models onto your GPU network topology for full model parallelism & pipelining.

Dynamic ML Trainer

Our project for CPSC 526 - Building Distributed Systems. We model the connections between GPUs as a weighted graph, with weights being GPU communication speed. We imagine the use case is a data center environment where groups of GPUs reside on the same machine and there's a wide range of latency & speed depending on interconnect types and network topology. Then, we partition the components of an arbitrary PyTorch model as optimally as possible to map on to these machines for full model parallelism via pipelining.

Andrew

Worked on integrating the algorithms with PyTorch
Wrote GPU benchmarking script
Wrote the PyTorch model splitting/pipelining code + that side of the demo code
Worked on setting up distributed environment, attempting NCCL over virtual Docker network which failed

Jeffrey

Wrote most sections of the write-up
Helped Andrew with the PyTorch pipelining implementation

Jiakang

Worked on implementing/testing the brute-force partitioning algorithm
Worked on implementing/testing the heuristic partitioning algorithm
Worked on implementing/testing the hierarchical partitioning algorithm
Helped write the algorithm section of the write-up

Demo Link

Demo Video

Instructions

The environment can be setup in a new Conda environment. CUDA 12.4 compatible GPUs and an x86_64 OS architecture can use these commands to set up the environment:

pip3 install torch torchvision torchaudio &&
apt-get -y install g++ &&
pip install torch-cluster torch-scatter -f https://data.pyg.org/whl/torch-2.5.1+cu124.html &&
pip install torch_geometric torchinfo matplotlib &&
pip install pyg-lib -f https://data.pyg.org/whl/torch-2.5.1+cu124.html

Running the Partioning Algorithms Separately

We have not only implemented the partitioning algorithms, but we also written a few toy examples that you run separately to see the performance of the various herustics.

You can run

python3 algo_bash.py

and can see what the best partitioning that is obtained by the brute-force algorithm performs after running for 2 minutes on a graph with 201 nodes split over 3 GPUs.

Similarly you can run

python3 algo_heuristic.py

and can see what the best partitioning that is obtained by the heuristic algorithm with max_iterations = 10 and trying 20 different initial random partitions.

Finally you can run

python3 algo_hierarchical.py

and observe how this algorithm performs on a toy "datacenter" example with 201 nodes but split over 96 GPUs.

You can also run the test suite for the algorithms and their helper functions via

pytest test_algos.py

You should see 8 tests passing.

Running the Training

For our method

python main.py

For the baseline PyTorch FSDP

python naive.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
algo_bash.py		algo_bash.py
algo_bash_andrew.py		algo_bash_andrew.py
algo_heuristic.py		algo_heuristic.py
algo_hierarchical.py		algo_hierarchical.py
at2275_lit_review.pdf		at2275_lit_review.pdf
gpu_benchmark.py		gpu_benchmark.py
gpu_speed.npy		gpu_speed.npy
graph.png		graph.png
graph.py		graph.py
jc4236_lit_review.pdf		jc4236_lit_review.pdf
main.py		main.py
metric_over_time_bash.csv		metric_over_time_bash.csv
metric_over_time_heuristic.csv		metric_over_time_heuristic.csv
metric_over_time_heuristic.png		metric_over_time_heuristic.png
model.py		model.py
naive.py		naive.py
net-ids.txt		net-ids.txt
report.pdf		report.pdf
submission (1).tar.gz		submission (1).tar.gz
submission.tar.gz		submission.tar.gz
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dynamic ML Trainer

Andrew

Jeffrey

Jiakang

Demo Link

Instructions

Running the Partioning Algorithms Separately

Running the Training

About

Uh oh!

Releases

Packages

Languages

28andrew/DynamicMLTrainer

Folders and files

Latest commit

History

Repository files navigation

Dynamic ML Trainer

Andrew

Jeffrey

Jiakang

Demo Link

Instructions

Running the Partioning Algorithms Separately

Running the Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages