Skip to content

juanfrcaliz/Impute-NAs-Better

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Impute-NAs-Better

Impute missing values while minimizing distortion of overall variable distributions by:

  1. Using available columns per row to create a bagged model.
  2. Applying that model to non-NA rows to find distribution of residuals.
  3. Adding variation to the model's output by adding a random residual to each of them.

As designed this imputer takes in a dataframe whose categorical variables are encoded as strings, and imputes NAs for all missing values, starting with the columns with the fewest NAs, then using the newly NA-free columns in the next imputations.

The regression estimator is linear regression, and the classifier is random forests.

This imputer is an implementation of a technique described in the following paper:

Joseph L. Schafer & Maren K. Olsen (1998) Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective, Multivariate Behavioral Research, 33:4, 545-571, DOI: 10.1207/s15327906mbr3304_5

About

Impute missing values

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 71.0%
  • Python 29.0%