MVA2016 datascience game 2016
code to train and test models
'Leaderboard.py' let you track best submissions in real time, with stdev, mean and number of submissions per teams
run the python notebook profiling.ipynb, this will:
- give insignt and statistics about the data
- compute 'fold_train.csv' a 5-fold separation taking user into account
run the python notebook features.ipynb on train, validation and test values, this will:
- compute the augmented features 'AugmentedFeatures{Train|Test|Priv}.csv'
run lda_topics_learn.ipynb to get
- 'lda_alex_5_topics.p'
- 'topics_alex.dict'
use 'lda_features_generator_traintest.ipynb' or 'lda_features_generator_priv.ipynb' to get LDA features in file: 'lda_features_5_{train|test|priv}_topics_df.csv'
run the 'Xgboost_v7.ipynb' several time and change the parameter 'fold_value' from 0 to 4 to get several models (change the output file for each run)
run 'Test_final_prediction.ipynb' specifying the model file to get the Y_{train|test|priv}.predict
run 'Bagging_vfinal.ipynb' choosing the 5 folds results file
run 'PearsonCorrelations.ipynb' on all models (we trained 11 different ones), to choose the 4 least correlated then run 'Bagging_vfinal.ipynb' to get results using the choosen models