This repository contains the code for various hackathon efforts to detect personally identifiable information in large language datasets, and in particular BigScience's datasets.
- Clone this repo and cd into it
- Install conda or miniconda
- Run
conda env create -f environment.yml
- this may take several minutes to install all dependencies and download models - Run
conda activate pii
to activate the conda environment - Now you can run
python3 test_regex.py -target_lang en
to test the regex for English. Other commands forthcoming! - You should create your own regex in your python code as follows:
from test_regex import apply_rules
infile = "<your infile such as en.jsonl>"
outfile = "<your outputfile>"
rulebase = [...] # see description below
target_lang = "<your lang>"
right, wrong = apply_rules(infile, outfile, rulebase, target_lang)
A rulebase is an ordered list of rule groups and the number of times to apply the rule groups. A rule group is on oredered list of rules of the form (new_label, regex, old_label, before text, after text) A rule will match if all of regex, old_label, before text and after text matches.
- new_label is the label to tag the matching text
- regex is the regex to use
- old label is the label that this text was previously tagged as. None means to ignore the test.
- before text is some text before the matching pattern. None means to ignore the test.
- after text is some text after the matching pattern. None means to ignore the test.
- Run
docker build aisc-pii .
to build the docker image - Run
docker run aisc-pii
to run the container. Currently it callspython3 test_regex.py -target_lang=en
- you will see the output after a minute or two!
[LEADS - BUT YOUR GITHUB HANDLES HERE]
- Hindi
- Farsi
- Mandarin: @ianyu93
- Vietnamese
- Russian
- Portugese
- English
- Swahili
- Yoruba
- Arabic
- Spanish
- French
[TBD: Put in the current status of the data tagging, which is partially completed. Reference LightTag and a Big THANKS to them!]
Please put your name by the regex you would like to work on here https://docs.google.com/spreadsheets/d/1rX_bH72CgLMwH5wxwakCAq-gbGsK4wFL4cGVqIZpvCQ/edit#gid=1934842843
This code is meant to be used in conjunction with the data pipeline developed under the bigscience github repository:
[TBD] Put names of all contributors here.