Skip to content
/ Koios Public

Tool to identify concept in the OMOP Genomic vocabulary from VCF and other files as well as HGVS notations

License

Notifications You must be signed in to change notification settings

OHDSI/Koios

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

KOIOS is an open source tool developed and supported by the OHDSI Oncology WG that allows users to combine their variant data with the OMOP Genomic Vocabulary in order to generate a set of genomic standard concept IDs from raw patient-level genomic data.

Installation

KOIOS can presently be installed directly from GitHub:

# install.packages("devtools")
devtools::install_github("odyOSG/KOIOS")

Usage

userScript.R

The file userScript.R may be loaded as a default workflow wherein only the initial reference genome and VCF file or VCF files directory need be specified.

Manual

Users must provide at least one valid VCF file in either .vcf or .vcf.gz format. This may be in the form of a single file, or a directory containing a set of .vcf or .vcf.gz files.

Users may simply run KOIOS according to the following simple pipeline:

library(KOIOS)

#Load the OMOP Genomic Vocabulary into R
concepts <- loadConcepts()

#Specify input file or directort
vcf <- loadVCF(userVCF = "Input.vcf")

#Specify and load human reference genome, if known
ref <- "hg19"
ref.df <- loadReference(ref)

#Process VCF and generate all relevant HGVSG identifiers for input records
vcf.df <- processVCF(vcf)
vcf.df <- generateHGVSG(vcf = vcf.df, ref = ref.df)

vcf.df <- processClinGen(vcf.df, ref = ref, progressBar = F)

#Combine this output data with the OMOP Genomic vocab to produce a DF containing a list of concept codes
vcf.df <- addConcepts(vcf.df, concepts, returnAll = T)

If the user is unaware of the reference genome used to generate a given VCF file they may run the following command, which checks their VCF variants against known ClinGen variants.

vcf <- loadVCF(userVCF = "Input VCF")

ref <- "auto"

ref <- findReference(vcf)
ref.df <- loadReference(ref)

Multi-VCF Pipeline

Multiple VCF files within a single directory may be submitted simultaneously within a single command:

#Load the VCF directory
vcf <- loadVCF(userVCF = "SomeDirectory/")

#Set ref to hg19
ref <- "hg19"

concepts.df <- multiVCFPipeline(vcf, ref, generateTranscripts, concepts)

While it is possible to use the automatic reference finder for multiple files, it is not recommended due to the long runtime.

Other Data Formats

It is also possible to run KOIOS on VCF-like data formats, with examples detailed below. An appropriate reference is required, as with VCF data.

cBioPortal mutations data

mutations <- read.csv("data_mutations.txt", sep = "\t")

#reference information is likely stored in mutations$NCBI_Build

mut_vcf <- processcBioPortal(mutations)
mut_vcf <- processClinGen(mut_vcf, ref = ref, progressBar = F)
mut_vcf <- addConcepts(mut_vcf,concepts)

HGVSG

HGVSg data can be directly read into KOIOS and submitted via the processClinGen function. A minimal HGVSg dataframe input requires a column named “hgvsg”.

hgvsg <- read.csv("hgvsg.csv", sep = "\t")
hgvsg <- processClingen(hgvsg,ref=ref)

HGVSc and transcript/protein data

Data already formatted into transcript (HGVSc) or protein (HGVSp) formats, such as with cBioPortal input data (As below), may also be submitted to KOIOS.

These data are simply matched directly with the extended concepts object, derived from the OMOP Genomic vocabulary.

transcript_data <- read.csv("data_transcripts.txt", sep = "\t")
transcript_merge <- merge(mut_transcripts,concepts_ext,by.x="hgvsc",by.y="concept_synonym_name)

#The following is an optional step to remove version information from input transcript HGVSc. 
#This allows for a wide range of older data to be submitted to the vocabulary, but has a small chance of generating false positive matches.

#transcript_data$match_hgvs <- gsub(".[0-9]*:",":",mut_transcripts$HGVSc)
#concepts_ext$match_hgvs <- gsub(".[0-9]*:",":",concepts_ext$concept_synonym_name)
#transcript_merge <- merge(mut_transcripts,concepts_ext,by="match_hgvs")

Fusions

KOIOS may also be used to match gene fusion data with the relevant concept_ids, such as with cBioPortal gene fusion data (As below).

concepts_fusion <- loadConcepts_fusions()

fusions_data <- read.csv("data_sv.txt", sep = "\t")
fusions_data <- generateFusions_cBioPortal(fusions_data,concepts_fusion)

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example at the GitHub issues page.

About

Tool to identify concept in the OMOP Genomic vocabulary from VCF and other files as well as HGVS notations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages