GitHub - wtsi-npg/plinktools: Process Plink genotyping files in Python

wtsi-npg / plinktools Public

forked from iainrb/plinktools

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Process Plink genotyping files in Python

GPL-3.0 license

1 star 2 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README		README
__init__.py		__init__.py
checksum.py		checksum.py
comparison.py		comparison.py
diff.py		diff.py
het_by_maf.py		het_by_maf.py
merge_bed.py		merge_bed.py
plink.py		plink.py
test.py		test.py

Repository files navigation

Plinktools: Python code to process Plink genotyping files

Author: Iain Bancarz, ib5@sanger.ac.uk

1. INTRODUCTION

Plinktools is a Python package for processing Plink format genotyping data. 
It was developed for the following applications not supported by the standard 
Plink executable.

1.1 Fast binary merge

The standard Plink merge function performs extended cross-checking and 
recoding of data. This is rather slow, and does not scale well with an 
increasing number of inputs.

To address this issue, Plinktools implements a fast binary merge for two 
important special cases: Congruent SNP sets with disjoint samples, and 
congruent samples with disjoint SNPs. In both cases, input and output is in 
the default SNP-major format. In the case of congruent samples, the fast 
merge strips off headers and concatenates the input .bed files with no need 
for recoding. For congruent SNPs, a fast merge can be done if the number of 
samples in each input is divisible by 4; otherwise recoding is necessary and 
the merge will be slowed, although still somewhat faster than Plink.

1.2 Heterozygosity calculation by high/low MAF

As part of the Wellcome Trust Oxford SOP for exome chip QC, heterozygosity 
for each sample is calculated separately for SNPs with minor allele frequency 
above and below 1%. This test has been implemented in Plinktools.

1.3 Extended diff

A wrapper for Plink's capability to compare two datasets. It constructs and 
executes the appropriate command, computes some additional statistics, and 
records the results in machine-readable JSON format. The script reports
all differing calls, and those which differ only by transposition of major
and minor alleles. (Such a transposition may be introduced by certain Plink
recoding operations.)

The extended diff also supports a simple equivalence test on two datasets,
via the --brief option.  The standard Plink executable does not provide a 
simple yes/no diff.

The extended diff obtains its input solely from the output of the Plink
executable. Specifically, it parses text output from Plink 1.0.7 and is not 
guaranteed to work with any other version. It does not use any of the 
plinktools Python classes to parse .bed files directly, although it could be 
adapted to do so.

2. USAGE

2.1 Contents

Plinktools includes three front-end scripts:

merge_bed.py to merge two or more Plink binary datasets
het_by_maf.py to compute heterozygosity for high/low MAF
diff.py for extended diff

Run any script with --help for more information.

2.2 Prerequisites

Python >= 2.7
Plink executable should be on the PATH environment variable: Required for 
tests and diff functionality.

3. SEE ALSO

The Plink data format was created by Shaun Purcell et al. See: 
http://pngu.mgh.harvard.edu/~purcell/plink

Plinktools was created to support the WTSI genotyping pipeline: 
https://github.com/wtsi-npg/genotyping

Plinktools is a prerequisite for the WTSI extension of the zCall genotype 
caller: https://github.com/wtsi-npg/zCall