- This repository contains a summary of the code used for our submissions to Semeval2022-Task 8 challenge
- For more details about our approach see the paper DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity submitted
- the "text" field used below is exactly the "body" field used in the paper
This repository contains the code to train Single-Field and Multiple-fields LMs and get predictions from these fine-tuned models. The notebook stacking.ipynb contains a draft of how to apply stacking starting from predictions generated by different fine-tuned models
- Install sentence-transformers package
- Install Trafilatura package
- Install cChardet package
- Install EasyNMT package for translation
- Use the official downloader script available
- Download the raw test and evaluation datasets available at https://competitions.codalab.org/competitions/33835
- Run the official script to scrape the urls in train and test datasets. Be sure to maintain the two directory separated. (use different folder as --dump_dir) For each page scraped one html and one json file are created.
- From downloaded html files extract relevant fields like title, description and body of the article using Trafilatura. A brief code example on how to get the content in a Python dictionary with Trafilatura
def extract_content(file_path):
tmp_diz = {}
with open(file_path, 'r', encoding='utf-8' ) as f:
try:
html_content = f.read()
try:
tmp_diz = trafilatura.bare_extraction(html_content, output_format='json', favor_precision=True)
return tmp_diz
except ValueError as e:
// manage exception
except:
// manange case in which file is not encoded in utf-8
See https://trafilatura.readthedocs.io/en/latest/index.html for more details
Consider to use cChardet to extract content from file encoded with legacy standards like Windows-1254 or Windows-1256
- Starting from the datasets containing urls and data extracted with Trafilatura, create the actual train and test datasets. They need to have the following columns: "pair_id", "url1_lang", "url2_lang", "title1","title2", "description1", "description2", "text1", "text2" Training dataset will also have "Overall" column representing gold similarity .
- Create a translated version of the datasets using translate.ipynb notebook.
- Manage empty/NaNs fields by replacing missing descriptions with texts and missing titles with descriptions.
- Run split_datasets.ipynb notebook to split in training/validation
- To train Single-Field LMs and Multiple-Fields LMs use the script train.py which takes the following parameters:
- file_name (file to use for training. It must be located under /datasets/train/ folder. The existence of splitted versions created at point 6 is assumed)
- l (Language of input data "en" for english, "multi" for multilingual)
- bs (Batch size to use. Default is 8)
- epochs (Training epochs. Default is 3)
- max_len (Max length used for tokenization. Default is 256)
- ws (Warmup steps expressed in percentage of train iterations. Default is 10% (0.10))
- tf1 (First training field)
- tf2 (Second training field, optional)
- tf3 (Third training field, optional)
An example:
$ python3 ./training.py -file_name file_name.csv -l en -tf_1 title -tf_2 description -bs 8 -epochs 1 -max_len 256 -ws 0.1
Trained models are saved under /output/file_name/sentence-transformers/model_folder
model_folder is marked with timestamp and fields used for training.
9. To create the prediction starting from fine-tuned models run predict.py script which takes the following parameters:
- predict_on (complete path to csv file on which make predictions)
- model_dir (complete path to directory of trained model to be used for the predictions)
An example:
!python3 ./predict.py -predict_on datasets/test/multilingual_test.csv -model_dir output/translated_train/sentence-transformers/all-mpnet-base-v2_28-02-2022__14:22:05_title_description
Predictions are saved in ./predictions with as csv files with filenames_field_used_for_training.csv
predicted values are in range [-1,1]
Be sure to use the right model for predictions (i.e. multilingual or english fine-tuned model according to language of the input file you provide )
An example of how the file structure should look like
.
├── project directory
│ ├── datasets
│ │ ├── train
│ │ │ ├── multilingual_train.csv
│ │ │ ├── translated_train.csv
│ │ │ ├── multilingual_train_X_train.csv
│ │ │ ├── multilingual_train_X_val.csv
│ │ │ ├── multilingual_train_y_train.csv
│ │ │ ├── multilingual_train_y_val.csv
│ │ │ ├── translated_train_X_train.csv
│ │ │ ├── translated_train_X_val.csv
│ │ │ ├── translated_train_y_train.csv
│ │ │ ├── translated_train_y_val.csv
│ │ ├── test
│ │ │ ├── multilingual_test.csv
│ │ │ └── translated_test.csv
│ │ ├── output
│ │ │ ├── multilingual_train
│ │ │ │ ├── sentence-transformers
│ │ │ │ │ ├── trained model
│ │ │ ├── translated_train
│ │ │ │ ├── sentence-transformers
│ │ │ │ │ ├── all-mpnet-base-v2_28-02-2022__14:22:05_title_description
│ │ │ ├── multilingua_train
│ │ │ │ ├── sentence-transformers
│ │ │ │ │ ├── paraphrase-multilingual-mpnet-base-v2_28-02-2022__18:32:36_description
│ │ ├── predictions
│ │ │ ├── trnslated_test_tile_description.csv
├── split_datasets.py
├── translate.py
├── stacking.ipynb
├── SbertTrainer.py
├── MultipleFieldsNet.py
├── ExtendedEmbeddingSimilarityEvaluator.py
├── MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py
├── MultipleFieldsCosineSimilarityLoss.py
└── README.md
- SbertTrainer.py is where the real training is done
- ExtendedEmbeddingSimilarityEvaluator.py and MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py are custom evaluators used to evaluate performances during training.
- MultipleFieldsCosineSimilarityLoss.py is the loss used for MultipleFields models.
training.py, predict.py, translate.py and split_datasets.py contain a configurable parameter:
- PROJECT_PATH --> absolute path to home main directory of the project. Set to /gdrive/My Drive/project_folder if you are using Google Colab
training.py contains also:├── stacking.ipynb - SCORE_MIN --> rescaled minimum value used for similarity (we use -0.1)
- SCORE_MAX --> rescaled maximum value used for similarity (we use 1)
Coming soon