Skip to content

Data Extraction, Validation, and Cleaning

Dylan Hall edited this page Mar 1, 2023 · 6 revisions

The CODI PPRL process depends on information pulled from a database or translated from a .csv file structured to match the CODI Data Model. extract.py either connects to a database and extracts information, or reads from a provided .csv file, cleaning and validating it to prepare it for the PPRL process. The script will output a temp-data/pii-TIMESTAMP.csv file that contains the PII ready for garbling.

To extract from a database, extract.py requires a database connection string to connect. Consult the SQLAlchemy documentation to determine the exact string for the database in use.

(To translate from a .csv file instead, extract.py requires a .json configuration file, the path to which must be specified with the --csv_config flag. The requirements of this configuration file are described below )

By default, extract.py expects a schema named cdm containing the CODI Data Model PRIVATE_DEMOGRAPHIC and PRIVATE_ADDRESS_HISTORY tables. The --schema_name option can be used to provide the name of schema containing these tables in the source system if other than cdm.

Because CODI PPRL currently operates with only a single address per individual and the PRIVATE_ADDRESS_HISTORY table allows for multiple addresses, the extraction process selects a single one. By default, the logic is to select the most recent (based on ADDRESS_PERIOD_START) address with ADDRESS_PREFERRED='Y', however this logic can be particularly slow on large datasets. The --address_selection flag can be used to specify alternate approaches:

  • --address_selection full : The default approach, select the most recent address with ADDRESS_PREFERRED='Y'
  • --address_selection preferred : This approach will bypass the date logic and select all rows where ADDRESS_PREFERRED='Y'. This is intended for data owners who may have multiple addresses but it is guaranteed that only one per PATID will have ADDRESS_PREFERRED='Y'.
  • --address_selection single : This approach will bypass all filters and select all addresses. This is intended for data owners who are guaranteed to only have a single address per PATID.

IMPORTANT: If using the preferred or single approach as described above, it is critical to ensure the extracted pii.csv contains exactly one row per PATID. Using the data characterization script described below, review the total number of rows to ensure it matches your expectations (ie, if it is low, then certain PATIDs may not have a PREFERRED address, or some other selection error may have occurred) and ensure the number of duplicates is low, preferably 0 (if it is high, then there may be multiple addresses per individual).

A note on performance: even with the preferred or single approaches here, selecting all records may be a slow process. If possible, adding the following database indices is highly recommended:

  • Unique index on PRIVATE_DEMOGRAPHIC (PATID)
  • Index on PRIVATE_ADDRESS_HISTORY (PATID) -- (potentially a unique index if your specific setup guarantees this)
  • Index on PRIVATE_ADDRESS_HISTORY (PATID, ADDRESS_PREFERRED) -- (again, potentially a unique index if your setup guarantees this expectation)

Finally if you specify the --verbose flag, the script will print a report to the terminal, documenting various issues it found when extracting the data. An example execution of the script is included below:

$ python extract.py postgresql://codi:codi@localhost/codi -v
Total records exported: 5476

record_id
--------------------

given_name
--------------------
Contains Non-ASCII Characters: 226
Contains Non-printable Characters: 3

family_name
--------------------
Contains Non-ASCII Characters: 2

DOB
--------------------

sex
--------------------

phone_number
--------------------

household_street_address
--------------------
Contains Non-ASCII Characters: 1
Contains Non-printable Characters: 1

household_zip
--------------------
NULL Value: 9

CSV Translation Configuration File

The configuration file used to extract and translate data for PPRL from a .csv file must be a .json file, the path to which is specified with the --csv_config flag.

python extract.py my_data.csv --csv_config my_config.json

The .json file must contain the following fields:

  • A "date_format" field which specifies the string date representation of dates within the .csv for DOB extraction. The string must conform to the 1989 C Date format standard. See Python's datetime documentation for the most relevant information.
  • A "translation_map" field that contains an object mapping standardized column names used in the PPRL process to the column names within the .csv file. The following columns require a mapping or a default value:
    • "given_name"
    • "family_name"
    • "DOB"
    • "sex"
    • "phone"
    • "address"
    • "zip"

Default values are provided as an object within the "tranlsation_map" under the field "default_values". It both provides a single default value to assign to an entire column if not present within the .csv or if a given record has no value for that field

The configuration file may also optionally contain the following fields:

  • A "value_mapping_rules" field which contains an dictionary which maps values found within a given column to values expected for the PPRL process (e.g. mapping "Male" and "Female" within the "sex" column to "M" and "F", respectively)
  • An "initial_id" field which gives an integer to be used as the ID number for each of the records if a suitable column is not present in the .csv file or if a value is not present for that field in a given record. IDs are generated sequentially counting up from the provided "initial_id" if used.

See testing-and-tuning/sample_conf.json for an example of the configuration file.

Data Quality and Characterization

A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the pii-TIMESTAMP.csv generated by extract.py or directly against the database used by extract.py. It is recommended that data_analyis.py at least be run against the generated pii-TIMESTAMP.csv file to help ensure that extraction succeeded successfully.

python data_analysis.py --csv temp-data/pii-TIMESTAMP.csv

python data_analysis.py --db postgresql://username:password@host:port/database

data_analysis.py produces two json text files containing summary statistics including: total number of records, number of individuals of each gender, zip code and phone number formats, most common zip codes, and address length at various percentiles among others. Any aberrant results should be investigated rectified within the data set before proceeding.

Notes:

  1. DB-related command-line arguments from extract.py listed above, such as specifying a different schema, are also available here. Run python data_analysis.py -h to see the full list of available options.
  2. The data_analysis.py script does not currently support CSV translation the same way the extract.py script does. If you are using the CSV translation option for extract then you can only run data_analysis.py against the extracted pii.csv file.