Skip to content

Automatically partition PyTorch models onto your GPU network topology for full model parallelism & pipelining.

Notifications You must be signed in to change notification settings

28andrew/DynamicMLTrainer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dynamic ML Trainer

Our project for CPSC 526 - Building Distributed Systems. We model the connections between GPUs as a weighted graph, with weights being GPU communication speed. We imagine the use case is a data center environment where groups of GPUs reside on the same machine and there's a wide range of latency & speed depending on interconnect types and network topology. Then, we partition the components of an arbitrary PyTorch model as optimally as possible to map on to these machines for full model parallelism via pipelining.

Andrew

  • Worked on integrating the algorithms with PyTorch
  • Wrote GPU benchmarking script
  • Wrote the PyTorch model splitting/pipelining code + that side of the demo code
  • Worked on setting up distributed environment, attempting NCCL over virtual Docker network which failed

Jeffrey

  • Wrote most sections of the write-up
  • Helped Andrew with the PyTorch pipelining implementation

Jiakang

  • Worked on implementing/testing the brute-force partitioning algorithm
  • Worked on implementing/testing the heuristic partitioning algorithm
  • Worked on implementing/testing the hierarchical partitioning algorithm
  • Helped write the algorithm section of the write-up

Demo Link

Demo Video

Instructions

The environment can be setup in a new Conda environment. CUDA 12.4 compatible GPUs and an x86_64 OS architecture can use these commands to set up the environment:

pip3 install torch torchvision torchaudio &&
apt-get -y install g++ &&
pip install torch-cluster torch-scatter -f https://data.pyg.org/whl/torch-2.5.1+cu124.html &&
pip install torch_geometric torchinfo matplotlib &&
pip install pyg-lib -f https://data.pyg.org/whl/torch-2.5.1+cu124.html 

Running the Partioning Algorithms Separately

We have not only implemented the partitioning algorithms, but we also written a few toy examples that you run separately to see the performance of the various herustics.

You can run

python3 algo_bash.py

and can see what the best partitioning that is obtained by the brute-force algorithm performs after running for 2 minutes on a graph with 201 nodes split over 3 GPUs.

Similarly you can run

python3 algo_heuristic.py

and can see what the best partitioning that is obtained by the heuristic algorithm with max_iterations = 10 and trying 20 different initial random partitions.

Finally you can run

python3 algo_hierarchical.py

and observe how this algorithm performs on a toy "datacenter" example with 201 nodes but split over 96 GPUs.

You can also run the test suite for the algorithms and their helper functions via

pytest test_algos.py

You should see 8 tests passing.

Running the Training

For our method

python main.py

For the baseline PyTorch FSDP

python naive.py

About

Automatically partition PyTorch models onto your GPU network topology for full model parallelism & pipelining.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages