Skip to content

kernelmachine/silo-lm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SILO Language Models

This includes an original implementation of SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore by Sewon Min🌟, Suchin Gururangan🌟, Eric Wallace, Hannaneh Hajishirzi, Noah A. Smith, and Luke Zettlemoyer (:star2:: Equal contribution).

If you find our code, data, models, or the paper useful, please cite the paper:

@article{ silo,
    title={ {SILO} Language Models: Isolating Legal Risk in a Nonparametric Datastore },
    author={ Min, Sewon and Gururangan, Suchin and Wallace, Eric and Hajishirzi, Hannaneh and Smith, Noah and Zettlemoyer, Luke },
    year={ 2023 },
    journal={ arXiv preprint arXiv:2308.04430 },
    url={ https://arxiv.org/abs/2308.04430 }
}

For any questions related to the code, data, models or the paper, please leave issues or contact first authors.

Content

  1. Quick links
  2. Training
  3. Inference
    1. Preparation
    2. Parametric-only language model
    3. kNN-LM
    4. Retrieve-in-context LM

Quick links

You can access the Open License Corpus and pretrained SILO Models on Hugging Face 🤗

Training

We use OpenLM, a new model training library, to train SILO LMs. Stay tuned for a link to that repo!

Inference

We host a script to run the parametric LM, kNN-LM, and RIC-LM using the HF model. By default, all outputs are being saved under out.

Preparation

Installation

The code was tested with python 3.9.

conda create -n silo python=3.9
conda activate silo
pip install -r requirements.txt

To run kNN-LM, run the following as well.

conda install -c pytorch -c nvidia faiss-gpu=1.7.4 mkl=2021

To run RIC-LM, run the following as well.

pip install pyserini
conda install -c conda-forge openjdk=11

Download Data

Notes: In order to process MIMIC_III, you first need to go through the official approval process from the website. Then, you will be able to get access to NOTEEVENTS.csv, which you can place under data/.

python scripts/download_data.py --subset the-pile
python scripts/download_data.py --subset cc-news
python scripts/download_data.py --subset MIMIC_III
python scripts/download_data.py --subset amazon

You can optionally specify --split to indicate whether you want to download train, val or test, separately by comma. Default value is train,val,test.

The training data is only required for building a datastore. Therefore, if you are mainly interested in parametric LMs, you won't need to download the training data. In this case, simply specify --split val,test.

For the Pile, in order to release the minimal data to reproduce the experiments in the paper, we release the training data on five domains that we include in the paper, i.e., Wikipedia_(en), NIH_ExPorter, Books3, Github, and Enron_Emails. The script to reproduce experiments in any domain requires running a deduplication script, because we performed an additional filtering of the training data against the val/test data of the Pile. We will release it in the future.

Tokenization

You should first tokenize the PILE by run the following.

PYTHONPATH=. python scripts/main.py \
    --task tokenize \
    --split {val|test|train} \
    --subset "FreeLaw,Gutenberg_(PG_19),HackerNews,Github,NIH_ExPorter,PhilPapers,Wikipedia_(en),cc-news,BookCorpus2,Books3,OpenWebText2,Enron_Emails,amazon,MIMIC_III" \
    --lm pythia-1.4B

Specify --subset separated by a comma to specify multiple subsets (domains). You don't need to tokenize the training data if you are mainly interested in evaluating the parametric LM.

Parametric-only language model

Note: Please specify --lm pythia-1.4B for the Pythia baseline, and --lm silo-pd-1.3b, --lm silo-pdsw-1.3b, or --lm silo-pdswby-1.3b for SILO, trained on either the PD subset, the PDSW subset and all data (PDSWBY). From here, we will use --lm silo-pdsw-1.3b in the example commands but the same commands will work for all models. All four models share the tokenizer, and it is not necessary to run tokenization for each model separately.

To encode & get LM perplexity, run the following:

PYTHONPATH=. python scripts/main.py \
    --task encode \
    --split {val|test} \
    --max_seq_length 1024 \
    --stride 512 \
    --subset "FreeLaw,Gutenberg_(PG_19),HackerNews,Github,NIH_ExPorter,PhilPapers,Wikipedia_(en),cc-news,BookCorpus2,Books3,OpenWebText2,Enron_Emails,amazon,MIMIC_III" \
    --lm silo-pdsw-1.3b

Once it's done with getting perplexity, it will save the values in a file. When you run the same command again, it will simply read the file instead of re-running the model.

  • Please specify --batch_size in order to adjust the batch size. Note that the maximum batch size that fits into the GPU may vary between Pythia and SILO (SILO can usually take 2-3x larger batch size).
  • You can also evaluate your own hf model. Place your hf model under ckpt/{your_hf_model_name}, and specify --lm {your_hf_model_name}.
  • By default, it will save both LM perplexity and embeddings from the model, which is needed for nonparametric LMs. If you do not plan to use nonparametric LM, you can skip saving embeddings by specifying --skip_embed.

kNN-LM

A quick tutorial on Enron_Emails.

First, this is the command line for Enron_Emails.

# First, get embeddings from the training data.
PYTHONPATH=. python scripts/main.py \
    --task encode \
    --split train \
    --max_seq_length 1024 \
    --stride 512 \
    --subset Enron_Emails \
    --lm silo-pdsw-1.3b
# Then, build the FAISS index and run evaluation.
PYTHONPATH=. python scripts/main.py \
    --task inference \
    --split train \
    --val_split {val|test} \
    --max_seq_length 1024 \
    --stride 512 \
    --subset Enron_Emails \
    --lm silo-pdsw-1.3b
  • You can specify --approximate for approximate nearest neighbor search, which significantly speed-ups the inference with little drop in performance. In the paper, we did not use --approximate for Enron_Emails, but used it for the rest of the domains whose datastores are much larger.
  • You can specify --probe (default: 8). A smaller value will linearly speed-up the inference with little drop in performance. In the paper, we use 8.
  • You can specify --do_subset in order to evaluate on a subset of 1.024M tokens, which is often enough to give stable PPL values.
  • If you run the same command, it will load the saved result file and display the results without re-running kNN-LM.

For other datasets