Skip to content

Latest commit

 

History

History
42 lines (25 loc) · 4.88 KB

Sequencing_data.md

File metadata and controls

42 lines (25 loc) · 4.88 KB

Sequencing Data

HiFi Data

Sequenced used for the v0.7 assembly were generated by HPRC and GIAB and are available from AWS. HiFi Revio data were used in the creation of v0.8, v0.9, and v1.0 and are available from PacBio as HG002-rep1, HG002-rep2, and HG002-rep3 and on AWS.

A newly available HiFi Revio dataset for HG002 has been provided by PacBio and is posted on AWS.

Oxford Nanopore Data

Nanopore sequencing was performed by HPRC and GIAB. The fastq data used for the v0.7 assembly are available from AWS. The raw fast5 files are available from the original data sources at HPRC, GIAB, and ONT (Sept and Nov 2020 releases). For v0.8, v0.9, and v1.0, we made use of R10 duplex data available from AWS. In December, 2023, EPI2ME announced the release of roughly 40x coverage of ultra-long R10 reads with read length N50 of 91kbp and a median accuracy of Q26.4. These reads, often referred to as "Q28", are available from EPI2ME.

Element Biosciences Data

Element Biosciences whole genome data from PCR-free libraries for the entire HG002 trio was used in creating v0.9 and v1.0, and is available on AWS and on the Element Biosciences website.

PacBio Onso Sequence Data

Onso sequencing data was provided by Mark Fleharty at the Broad Institute and is available on AWS. Additional Onso sequencing data was provided by Chris Mason at Weill Cornell Medicine and is also available on AWS.

StrandSeq

Strandseq data is available from the HPRC

HiC

Hi-C data is available from the HPRC. The assembly used HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz and HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R2_001.fastq.gz.

Illumina PCRFree Data

For polishing, we made use of 2x250 whole genome sequence from Illumina available from NCBI.

Notes on downloading files.

Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/ with s3://human-pangenomics/T2T to download.

The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/scratch/HG002/

or to obtain technology-specific sizes.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/scratch/HG002/sequencing
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/scratch/HG002/assemblies

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

You can also browse all the files available on S3 via web interface.