Skip to content

Latest commit

 

History

History
43 lines (23 loc) · 2.59 KB

README.md

File metadata and controls

43 lines (23 loc) · 2.59 KB

bigbed-jaccard

MIT License

Overview

bigbed-jaccard is a tool designed to quickly estimate pairwise Jaccard similarities between large numbers of bigBed files by using minhashing. If you only want to compare a couple of bed files, it will probably be faster to download them and compute the Jaccard distances directly using bedtools jaccard.

See Details for more information on the algorithm.

Installation

Clone this repo, cd into it, and run cargo build --release. The binary will then be available at target/release/bigbed-jaccard-similarity-matrix

This will require that you have cargo installed, I would recommend using rustup to install the Rust toolchain. Build has been tested on rustc version 1.42.0

Usage

Here, infile is simply a file containing a list of file URLs, one per line, and outfile is the name of the file to write the comparisons to.

$ target/release/bigbed-jaccard-similarity-matrix [infile] [outfile]

There is some utility code in this repository to facilitate data aquisition from the ENCODE portal. Namely, this script will allow you to flatten ENCODE file report TSVs containing JSON objects in the cloud_metadata field, requires pandas>=1.0. You can then convert the output of that script into a file list by extracting the url field and deleting the header.

Benchmarks

Running pairwise comparisons on the 2,677 bigBed files in this file took ~13 minutes on a Mac with 2.6 GHz 6-Core Intel i7 processor and an SSD, resulting in 3,581,827 output comparisons. Memory usage hovered around 1.5 GB.

Details

The minhash algorithm variant used here is bottom-k minhashing, using a single 32-bit hash function described in [1].

Future Work

I'd like to make the underlying code available as a Rust library and also generate Python (possibly NumPy) bindings for the Rust functions.

References

  1. Mikkel Thorup. 2013. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the forty-fifth annual ACM symposium on Theory of Computing (STOC ’13). Association for Computing Machinery, New York, NY, USA, 371–380.

  2. Anshumali Shrivastava and Ping Li. 2014. Improved densification of one permutation hashing. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI’14). AUAI Press, Arlington, Virginia, USA, 732–741.