ML benchmark

Training time benchmark for Machine Learning algorithms

Installation steps

Clone the repo locally

git clone git@github.com:Ludecan/ml_benchmark.git

Install pyenv following the instructions here: https://github.com/pyenv/pyenv
Install poetry following the instructions here: https://python-poetry.org/docs/#installation
Install python 3.10.11 using pyenv and use it

pyenv install 3.10.11
pyenv shell 3.10.11

Tell poetry to use the recently installed Python version

pyenv which python | xargs poetry env use

Configure poetry to keep venvs locally (if you use VSCode, it will detect it out of the box):

poetry config virtualenvs.in-project true

Install dependencies (this will also create the venv):

poetry install

Running the benchmark

Inside the poetry virtualenv run:

python regression_benchmark.py

Considerations about results

The main purpose of this benchmark is to compare the training time of different CPU implementations of ML Regression Algorithms for datasets of varying sizes under different hardware configurations. It creates random, all float in the [-10, 10] range datasets with several row and column numbers and a synthetic target is created from them using the following generalization of the Rosenbrock to multiple input dimensions (thanks for it ChatGPT!)

f(x) = Σ[ c * (x[i+1] - x[i]^2)^2 + (1 - x[i])^2 ] for i in [0, N-2]

Notice the function was requested to be non-linear and have interactions between all pairs of consecutive features, allowing non linear models to show their strengths (and utterly defeating linear models). The accuracy metrics (ME, MAE, RMSE, R^2) are provided for reference but generalization of these results to other datasets is not advised without proper testing. It is relatively simple to switch the random datasets used in this benchmark for your own dataset if you want to try these models yourself.

Raw results

I'm doing this to compare performance on an I9-13900K system with DDR4 and DDR5 RAMs at different speeds, in order to find out how sensitive these algorithms are to RAM bandwidth. You can find the results so far here: https://docs.google.com/spreadsheets/d/1gF8VUrR7Kc7Hc54cRcXFrOKSi_L02epqipe0k0pAlp8/edit?usp=sharing

TODO:

Dockerize installation to ensure common base libs
Profile maximum memory usage during execution of each model
Different targets/noise?

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
SystemInfo		SystemInfo
.gitignore		.gitignore
.isort.cfg		.isort.cfg
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
regression_benchmark.py		regression_benchmark.py
results_table.py		results_table.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML benchmark

Installation steps

Running the benchmark

Considerations about results

Raw results

About

Releases 1

Packages

Languages

License

Ludecan/ml_benchmark

Folders and files

Latest commit

History

Repository files navigation

ML benchmark

Installation steps

Running the benchmark

Considerations about results

Raw results

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages