Training time benchmark for Machine Learning algorithms
- Clone the repo locally
git clone git@github.com:Ludecan/ml_benchmark.git
- Install pyenv following the instructions here: https://github.com/pyenv/pyenv
- Install poetry following the instructions here: https://python-poetry.org/docs/#installation
- Install
python 3.10.11
usingpyenv
and use it
pyenv install 3.10.11
pyenv shell 3.10.11
- Tell poetry to use the recently installed Python version
pyenv which python | xargs poetry env use
- Configure poetry to keep venvs locally (if you use VSCode, it will detect it out of the box):
poetry config virtualenvs.in-project true
- Install dependencies (this will also create the
venv
):
poetry install
Inside the poetry virtualenv run:
python regression_benchmark.py
The main purpose of this benchmark is to compare the training time of different CPU implementations of ML Regression Algorithms for datasets of varying sizes under different hardware configurations. It creates random, all float in the [-10, 10] range datasets with several row and column numbers and a synthetic target is created from them using the following generalization of the Rosenbrock to multiple input dimensions (thanks for it ChatGPT!)
f(x) = Σ[ c * (x[i+1] - x[i]^2)^2 + (1 - x[i])^2 ] for i in [0, N-2]
Notice the function was requested to be non-linear and have interactions between all pairs of consecutive features, allowing non linear models to show their strengths (and utterly defeating linear models). The accuracy metrics (ME, MAE, RMSE, R^2) are provided for reference but generalization of these results to other datasets is not advised without proper testing. It is relatively simple to switch the random datasets used in this benchmark for your own dataset if you want to try these models yourself.
I'm doing this to compare performance on an I9-13900K system with DDR4 and DDR5 RAMs at different speeds, in order to find out how sensitive these algorithms are to RAM bandwidth. You can find the results so far here: https://docs.google.com/spreadsheets/d/1gF8VUrR7Kc7Hc54cRcXFrOKSi_L02epqipe0k0pAlp8/edit?usp=sharing
TODO:
- Dockerize installation to ensure common base libs
- Profile maximum memory usage during execution of each model
- Different targets/noise?