Normalizations

For instructions on using deepTools 2.0 or newer, please go here. This page only applies to deepTools 1.5

deepTools contains 3 tools for the normalization of BAM files:

correctGCbias: if you would like to normalize your read distributions to fit the expected GC values, you can use the output from computeGCbias and produce a GC-corrected BAM-file.
bamCoverage: this tool converts a single BAM file into a bigWig file, enabling you to normalize for sequencing depth.
bamCompare: like bamCoverage, this tool produces a normalized bigWig file, but it takes 2 BAM files, normalizes them for sequencing depth and subsequently performs a mathematical operation of your choice, i.e. it can output the ratio of the read coverages in both files or the like.

Here you can download slides that we used for teaching. They contain additional details about how the coverage files are generated and normalized.

correctGCbias

What it does

This tool requires the output from computeGCBias to correct the given BAM files according to the method proposed by Benjamini and Speed.

correctGCbias will remove reads from regions with too high coverage compared to the expected values (typically GC-rich regions) and will add reads to regions where too few reads are seen (typically AT-rich regions).

The resulting BAM files can be used in any downstream analyses, but be aware that you should not filter out duplicates from here on (duplicate removal would eliminate those reads that were added to reach the expected number of reads for GC-depleted regions).

output

GC-normalized BAM file

Usage

correctGCbias is based on the calculations done by computeGCbias and requires that you generated a "GC-bias frequency file". This file is a table indicating the expected numbers of reads per GC content. Once you've ran computeGCbias and you wish to correct your read distributions to match the expected values, correctGCbias can be run as follows (--effectiveGenomeSize and --genome should be the same as for computeGCbias):

$ /deepTools-1.5/bin/correctGCBias --bamfile myReads.bam \
--effectiveGenomeSize 2150570000 --genome mm9.2bit \
--GCbiasFrequenciesFile frequencies.txt \
--correctedFile myReads_GCcorrected.bam

For more information about the individual parameters, see our page about All command line options.

bamCoverage

What it does

Given a BAM file, this tool generates a bigWig or bedGraph file of fragment or read coverages. The way the method works is by first calculating all the number of reads (either extended to match the fragment length or not) that overlap each bin in the genome. Bins with zero counts are skipped, i.e. not added to the output file. The resulting [read][] counts can be normalized using either a given scaling factor, the RPKM formula or to get a 1x depth of coverage (RPGC). In the case of paired-end mapping each read mate is treated independently to avoid a bias when a mixture of concordant and discordant pairs is present. This means that each end will be extended to match the fragment length.

RPKM:
- reads per kilobase per million reads
- The formula is: RPKM (per bin) = number of reads per bin / ( number of mapped reads (in millions) * bin length (kp) )
RPGC:
- reads per genomic content
- used to normalize reads to 1x depth of coverage
- sequencing depth is defined as: (total number of mapped reads * fragment length) / effective genome size

output

coverage file either in bigWig or bedGraph format

Usage

Here's an example command to generate a single bigWig file out of a single BAM file via the command line:

$ /deepTools-1.5/bin/bamCoverage --bam corrected_counts.bam \
--binSize 10 --normalizeTo1x 2150570000 --fragmentLength 200 \
-o Coverage.GCcorrected.SeqDepthNorm.bw --ignoreForNormalization chrX

The bin size (-bs) can be chosen completely to your liking. The smaller it is, the bigger your file will be.
This was a mouse sample, therefore the effective genome size for mouse had to be indicated once it was decided that the file should be normalize to 1x coverage.
Chromosome X was excluded from sampling the regions for normalization as the sample was from a male mouse that therefore contained pairs of autosome, but only a single X chromosome.
The fragment length of 200 bp is only the fall-back option of bamCoverage as the sample provided here was done with paired-end sequencing. Only in case of singletons will bamCoverage resort to the user-specified fragment length.
--ignoreDuplicates - important! in case where you normalized for GC bias using correctGCbias, you should absolutely NOT set this parameter

Using deepTools Galaxy, this is what you would have done (pay attention to the hints on the command line as well!):

bamCompare

What it does

This tool compares two BAM files based on the number of mapped reads. To compare the BAM files, the genome is partitioned into bins of equal size, then the number of reads found in each BAM file is counted for such bins and finally a summarizing value is reported. This value can be the ratio of the number of reads per bin, the log2 of the ratio or the difference. This tool can normalize the number of reads on each BAM file using the SES method proposed by Diaz et al. Normalization based on read counts is also available. The output is either a bedgraph or a bigwig file containing the bin location and the resulting comparison values. By default, if reads are mated, the fragment length reported in the BAM file is used. In the case of paired-end mapping each read mate is treated independently to avoid a bias when a mixture of concordant and discordant pairs is present. This means that each end will be extended to match the fragment length. bamCompare only uses the common chromosomes between the two BAM files. The --verbose option shows the common chromosomes used.

output file

same as for bamCoverage, except that you now obtain 1 coverage file that is based on 2 BAM files.

Usage

Here's an example command that generated the log2(ChIP/Input) values via the command line.

$ /deepTools-1.5/bin/bamCompare --bamfile1 ChIP.bam --bamfile2 Input.bam \
--binSize 25 --fragmentLength 200 --missingDataAsZero no \
--ratio log2 --scaleFactorsMethod SES -o log2ratio_ChIP_vs_Input.bw

The Galaxy equivalent:

Note that the option "missing Data As Zero" can be found within the "advanced options" (default: no).

like for bamCoverage, the bin size is completely up to the user
the fragment size (-f) will only be taken into consideration for reads without mates
the SES method (see below) was used for normalization as the ChIP sample was done for a histone mark with highly localized enrichments (similar to the left-most plot of the fingerprint-examples

Some (more) parameters to pay special attention to

--scaleFactorsMethod (in Galaxy: "Method to use for scaling the largest sample to the smallest")

Here, you can choose how you would like to normalize to account for variation in sequencing depths. We provide:

the simple normalization total [read][] count
the more sophisticated signal extraction (SES) method proposed by Diaz et al. for the normalization of ChIP-seq samples. We recommend to use SES only for those cases where the distinction between [input][] and ChIP is very clear in the bamFingerprint plots. This is usually the case for transcription factors and sharply defined histone marks such as H3K4me3.

--ratio (in Galaxy: "How to compare the two files")

Here, you get to choose how you want the two input files to be compared, e.g. by taking the ratio or by subtracting the second BAM file from the first BAM file etc. In case you do want to subtract one sample from the other, you will have to choose whether you want to normalize to 1x coverage (--normalizeTo1x) or to Reads Per Kilobase per Million reads (--normalizeUsingRPKM; similar to RNA-seq normalization schemes).

[read]: https://github.com/fidelram/deepTools/wiki/Glossary#terminology "the DNA piece that was actually sequenced ("read") by the sequencing machine (usually between 30 to 100 bp long, depending on the read-length of the sequencing protocol)" [input]: https://github.com/fidelram/deepTools/wiki/Glossary#terminology "confusing, albeit commonly used name for the 'no-antibody' control sample for ChIP experiments"

deepTools is developed by the Bioinformatics Facility at the Max Planck Institute for Immunobiology and Epigenetics, Freiburg. For troubleshooting, see our FAQ and get in touch: deeptools@googlegroups.com