Skip to content

DataSciencePolimi/MultilingualNewsArticleSimilarity

Repository files navigation

MultilingualNewsArticleSimilarity

  • This repository contains a summary of the code used for our submissions to Semeval2022-Task 8 challenge
  • For more details about our approach see the paper DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity submitted
  • the "text" field used below is exactly the "body" field used in the paper

Index

About

This repository contains the code to train Single-Field and Multiple-fields LMs and get predictions from these fine-tuned models. The notebook stacking.ipynb contains a draft of how to apply stacking starting from predictions generated by different fine-tuned models

Installation

Usage

  1. Download the raw test and evaluation datasets available at https://competitions.codalab.org/competitions/33835
  2. Run the official script to scrape the urls in train and test datasets. Be sure to maintain the two directory separated. (use different folder as --dump_dir) For each page scraped one html and one json file are created.
  3. From downloaded html files extract relevant fields like title, description and body of the article using Trafilatura. A brief code example on how to get the content in a Python dictionary with Trafilatura
def extract_content(file_path):
    tmp_diz = {}
    with open(file_path, 'r', encoding='utf-8' ) as f:
            try:
                html_content = f.read()
                try:  
                    tmp_diz = trafilatura.bare_extraction(html_content, output_format='json', favor_precision=True)
                    return tmp_diz
                except ValueError as e:
                    // manage exception
            except:
            // manange case in which file is not encoded in utf-8

See https://trafilatura.readthedocs.io/en/latest/index.html for more details

Consider to use cChardet to extract content from file encoded with legacy standards like Windows-1254 or Windows-1256

  1. Starting from the datasets containing urls and data extracted with Trafilatura, create the actual train and test datasets. They need to have the following columns: "pair_id", "url1_lang", "url2_lang", "title1","title2", "description1", "description2", "text1", "text2" Training dataset will also have "Overall" column representing gold similarity .
  2. Create a translated version of the datasets using translate.ipynb notebook.
  3. Manage empty/NaNs fields by replacing missing descriptions with texts and missing titles with descriptions.
  4. Run split_datasets.ipynb notebook to split in training/validation
  5. To train Single-Field LMs and Multiple-Fields LMs use the script train.py which takes the following parameters:
    • file_name (file to use for training. It must be located under /datasets/train/ folder. The existence of splitted versions created at point 6 is assumed)
    • l (Language of input data "en" for english, "multi" for multilingual)
    • bs (Batch size to use. Default is 8)
    • epochs (Training epochs. Default is 3)
    • max_len (Max length used for tokenization. Default is 256)
    • ws (Warmup steps expressed in percentage of train iterations. Default is 10% (0.10))
    • tf1 (First training field)
    • tf2 (Second training field, optional)
    • tf3 (Third training field, optional)
      An example:
$ python3 ./training.py -file_name file_name.csv -l en -tf_1 title -tf_2 description -bs 8 -epochs 1 -max_len 256 -ws 0.1

Trained models are saved under /output/file_name/sentence-transformers/model_folder
model_folder is marked with timestamp and fields used for training. 9. To create the prediction starting from fine-tuned models run predict.py script which takes the following parameters:

  • predict_on (complete path to csv file on which make predictions)
  • model_dir (complete path to directory of trained model to be used for the predictions)

    An example:
!python3 ./predict.py -predict_on datasets/test/multilingual_test.csv -model_dir output/translated_train/sentence-transformers/all-mpnet-base-v2_28-02-2022__14:22:05_title_description

Predictions are saved in ./predictions with as csv files with filenames_field_used_for_training.csv
predicted values are in range [-1,1]
Be sure to use the right model for predictions (i.e. multilingual or english fine-tuned model according to language of the input file you provide )

File Structure

An example of how the file structure should look like

.
├── project directory
│   ├── datasets
│   │   ├── train
│   │   │   ├── multilingual_train.csv
│   │   │   ├── translated_train.csv
│   │   │   ├── multilingual_train_X_train.csv
│   │   │   ├── multilingual_train_X_val.csv
│   │   │   ├── multilingual_train_y_train.csv
│   │   │   ├── multilingual_train_y_val.csv
│   │   │   ├── translated_train_X_train.csv
│   │   │   ├── translated_train_X_val.csv
│   │   │   ├── translated_train_y_train.csv
│   │   │   ├── translated_train_y_val.csv
│   │   ├── test
│   │   │   ├── multilingual_test.csv
│   │   │   └── translated_test.csv
│   │   ├── output
│   │   │   ├── multilingual_train
│   │   │   │   ├── sentence-transformers
│   │   │   │   │   ├── trained model
│   │   │   ├── translated_train
│   │   │   │   ├── sentence-transformers
│   │   │   │   │   ├── all-mpnet-base-v2_28-02-2022__14:22:05_title_description
│   │   │   ├── multilingua_train
│   │   │   │   ├── sentence-transformers
│   │   │   │   │   ├── paraphrase-multilingual-mpnet-base-v2_28-02-2022__18:32:36_description
│   │   ├── predictions
│   │   │   ├── trnslated_test_tile_description.csv
├── split_datasets.py
├── translate.py
├── stacking.ipynb
├── SbertTrainer.py
├── MultipleFieldsNet.py
├── ExtendedEmbeddingSimilarityEvaluator.py
├── MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py
├── MultipleFieldsCosineSimilarityLoss.py
└── README.md
  • SbertTrainer.py is where the real training is done
  • ExtendedEmbeddingSimilarityEvaluator.py and MultipleFieldsExtendedEmbeddingSimilarityEvaluator.py are custom evaluators used to evaluate performances during training.
  • MultipleFieldsCosineSimilarityLoss.py is the loss used for MultipleFields models.

Details

training.py, predict.py, translate.py and split_datasets.py contain a configurable parameter:

  • PROJECT_PATH --> absolute path to home main directory of the project. Set to /gdrive/My Drive/project_folder if you are using Google Colab
    training.py contains also:├── stacking.ipynb
  • SCORE_MIN --> rescaled minimum value used for similarity (we use -0.1)
  • SCORE_MAX --> rescaled maximum value used for similarity (we use 1)

Credits

Coming soon

About

Code of our best model for Semeval2022-Task 8

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages