MultilingualNewsArticleSimilarity

This repository contains a summary of the code used for our submissions to Semeval2022-Task 8 challenge
For more details about our approach see the paper DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity submitted
the "text" field used below is exactly the "body" field used in the paper

About

This repository contains the code to train Single-Field and Multiple-fields LMs and get predictions from these fine-tuned models. The notebook stacking.ipynb contains a draft of how to apply stacking starting from predictions generated by different fine-tuned models

Installation

Install sentence-transformers package
Install Trafilatura package
Install cChardet package
Install EasyNMT package for translation
Use the official downloader script available

Usage

Download the raw test and evaluation datasets available at https://competitions.codalab.org/competitions/33835
Run the official script to scrape the urls in train and test datasets. Be sure to maintain the two directory separated. (use different folder as --dump_dir) For each page scraped one html and one json file are created.
From downloaded html files extract relevant fields like title, description and body of the article using Trafilatura. A brief code example on how to get the content in a Python dictionary with Trafilatura

def extract_content(file_path):
    tmp_diz = {}
    with open(file_path, 'r', encoding='utf-8' ) as f:
            try:
                html_content = f.read()
                try:  
                    tmp_diz = trafilatura.bare_extraction(html_content, output_format='json', favor_precision=True)
                    return tmp_diz
                except ValueError as e:
                    // manage exception
            except:
            // manange case in which file is not encoded in utf-8

See https://trafilatura.readthedocs.io/en/latest/index.html for more details

Consider to use cChardet to extract content from file encoded with legacy standards like Windows-1254 or Windows-1256

Starting from the datasets containing urls and data extracted with Trafilatura, create the actual train and test datasets. They need to have the following columns: "pair_id", "url1_lang", "url2_lang", "title1","title2", "description1", "description2", "text1", "text2" Training dataset will also have "Overall" column representing gold similarity .
Create a translated version of the datasets using translate.ipynb notebook.
Manage empty/NaNs fields by replacing missing descriptions with texts and missing titles with descriptions.
Run split_datasets.ipynb notebook to split in training/validation
To train Single-Field LMs and Multiple-Fields LMs use the script train.py which takes the following parameters:
- file_name (file to use for training. It must be located under /datasets/train/ folder. The existence of splitted versions created at point 6 is assumed)
- l (Language of input data "en" for english, "multi" for multilingual)
- bs (Batch size to use. Default is 8)
- epochs (Training epochs. Default is 3)
- max_len (Max length used for tokenization. Default is 256)
- ws (Warmup steps expressed in percentage of train iterations. Default is 10% (0.10))
- tf1 (First training field)
- tf2 (Second training field, optional)
- tf3 (Third training field, optional)
  An example:

$ python3 ./training.py -file_name file_name.csv -l en -tf_1 title -tf_2 description -bs 8 -epochs 1 -max_len 256 -ws 0.1

Trained models are saved under /output/file_name/sentence-transformers/model_folder
model_folder is marked with timestamp and fields used for training. 9. To create the prediction starting from fine-tuned models run predict.py script which takes the following parameters:

predict_on (complete path to csv file on which make predictions)
model_dir (complete path to directory of trained model to be used for the predictions)

An example:

!python3 ./predict.py -predict_on datasets/test/multilingual_test.csv -model_dir output/translated_train/sentence-transformers/all-mpnet-base-v2_28-02-2022__14:22:05_title_description

Predictions are saved in ./predictions with as csv files with filenames_field_used_for_training.csv
predicted values are in range [-1,1]
Be sure to use the right model for predictions (i.e. multilingual or english fine-tuned model according to language of the input file you provide )

File Structure

An example of how the file structure should look like

.
├── project directory
│   ├── datasets
│   │   ├── train
│   │   │   ├── multilingual_train.csv
│   │   │   ├── translated_train.csv
│   │   │   ├── multilingual_train_X_train.csv
│   │   │   ├── multilingual_train_X_val.csv
│   │   │   ├── multilingual_train_y_train.csv
│   │   │   ├── multilingual_train_y_val.csv
│   │   │   ├── translated_train_X_train.csv
│   │   │   ├── translated_train_X_val.csv
│   │   │   ├── translated_train_y_train.csv
│   │   │   ├── translated_train_y_val.csv
│   │   ├── test
│   │   │   ├── multilingual_test.csv
│   │   │   └── translated_test.csv
│   │   ├── output
│   │   │   ├── multilingual_train
│   │   │   │   ├── sentence-transformers
│   │   │   │   │   ├── trained model
│   │   │   ├── translated_train
│   │   │   │   ├── sentence-transformers
│   │   │   │   │   ├── all-mpnet-base-v2_28-02-2022__14:22:05_title_description
│   │   │   ├── multilingua_train
│   │   │   │   ├── sentence-transformers
│   │   │   │   │   ├── paraphrase-multilingual-mpnet-base-v2_28-02-2022__18:32:36_description
│   │   ├── predictions
│   │   │   ├── trnslated_test_tile_description.csv
├── split_datasets.py
├── translate.py
├── stacking.ipynb
├── SbertTrainer.py
├── MultipleFieldsNet.py
├── ExtendedEmbeddingSimilarityEvaluator.py
├── MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py
├── MultipleFieldsCosineSimilarityLoss.py
└── README.md

SbertTrainer.py is where the real training is done
ExtendedEmbeddingSimilarityEvaluator.py and MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py are custom evaluators used to evaluate performances during training.
MultipleFieldsCosineSimilarityLoss.py is the loss used for MultipleFields models.

Details

training.py, predict.py, translate.py and split_datasets.py contain a configurable parameter:

PROJECT_PATH --> absolute path to home main directory of the project. Set to /gdrive/My Drive/project_folder if you are using Google Colab
training.py contains also:├── stacking.ipynb
SCORE_MIN --> rescaled minimum value used for similarity (we use -0.1)
SCORE_MAX --> rescaled maximum value used for similarity (we use 1)

Credits

Coming soon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MultilingualNewsArticleSimilarity

Index

About

Installation

Usage

File Structure

Details

Credits

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
datasets		datasets
ExtendedEmbeddingSimilarityEvaluator.py		ExtendedEmbeddingSimilarityEvaluator.py
LICENSE		LICENSE
MultipleFieldsCosineSimilarityLoss.py		MultipleFieldsCosineSimilarityLoss.py
MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py		MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py
MultipleFieldsNet.py		MultipleFieldsNet.py
README.md		README.md
SbertTrainer.py		SbertTrainer.py
predict.py		predict.py
requirements.txt		requirements.txt
split_datasets.py		split_datasets.py
training.py		training.py
translate.py		translate.py

License

DataSciencePolimi/MultilingualNewsArticleSimilarity

Folders and files

Latest commit

History

Repository files navigation

MultilingualNewsArticleSimilarity

Index

About

Installation

Usage

File Structure

Details

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages