Readme.Rmd

---
title: "README"
output: md_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

<p float="left">
<img src="./img/koios.png" style="vertical-align: center;" width="100"/><img src="./img/ods_logo.jpg" style="vertical-align: center;" width="100"/>
</p>

## Overview

KOIOS is a tool developed by [Odysseus Data Services Inc](https://odysseusinc.com/) that allows users to combine their variant data with the OMOP Genomic Vocabulary in order to generate a set of genomic standard concept IDs from raw patient-level genomic data.

## Installation

KOIOS can presently be installed directly from GitHub:

``` r
# install.packages("devtools")
devtools::install_github("odyOSG/KOIOS")
```

## Usage
### userScript.R

The file userScript.R may be loaded as a default workflow wherein only the initial reference genome and VCF file or VCF files directory need be specified.

### Manual

Users must provide at least one valid VCF file in either .vcf or .vcf.gz format. This may be in the form of a single file, or a directory containing a set of .vcf or .vcf.gz files.

Users may simply run KOIOS according to the following simple pipeline:

``` r

library(KOIOS)

#Load the OMOP Genomic Vocabulary into R
concepts <- loadConcepts()

#Specify input file or directort
vcf <- loadVCF(userVCF = "Input.vcf")

#Specify and load human reference genome, if known
ref <- "hg19"
ref.df <- loadReference(ref)

#Process VCF and generate all relevant HGVSG identifiers for input records
vcf.df <- processVCF(vcf)
vcf.df <- generateHGVSG(vcf = vcf.df, ref = ref.df)

vcf.df <- processClinGen(vcf.df, ref = ref, progressBar = F)

#Combine this output data with the OMOP Genomic vocab to produce a DF containing a list of concept codes
vcf.df <- addConcepts(vcf.df, concepts, returnAll = T)

```

If the user is unaware of the reference genome used to generate a given VCF file they may run the following command, which checks their VCF variants against known ClinGen variants.

``` r
vcf <- loadVCF(userVCF = "Input VCF")

ref <- "auto"

ref <- findReference(vcf)
ref.df <- loadReference(ref)
```

### Multi-VCF Pipeline

Multiple VCF files within a single directory may be submitted simultaneously within a single command:

```r
#Load the VCF directory
vcf <- loadVCF(userVCF = "SomeDirectory/")

#Set ref to hg19
ref <- "hg19"

concepts.df <- multiVCFPipeline(vcf, ref, generateTranscripts, concepts)

```

While it is possible to use the automatic reference finder for multiple files, it is not recommended due to the long runtime.

### Other Data Formats

It is also possible to run KOIOS on VCF-like data formats, with examples detailed below. An appropriate reference is required, as with VCF data.

#### cBioPortal mutations data
```r
mutations <- read.csv("data_mutations.txt", sep = "\t")

#reference information is likely stored in mutations$NCBI_Build

mut_vcf <- processcBioPortal(mutations)
mut_vcf <- processClinGen(mut_vcf, ref = ref, progressBar = F)
mut_vcf <- addConcepts(mut_vcf,concepts)

```

#### HGVSG

HGVSg data can be directly read into KOIOS and submitted via the processClinGen function. A minimal HGVSg dataframe input requires a column named "hgvsg".

```r

hgvsg <- read.csv("hgvsg.csv", sep = "\t")
hgvsg <- processClingen(hgvsg,ref=ref)

```

#### HGVSc and transcript/protein data

Data already formatted into transcript (HGVSc) or protein (HGVSp) formats, such as with cBioPortal input data (As below), may also be submitted to KOIOS.

These data are simply matched directly with the extended concepts object, derived from the OMOP Genomic vocabulary.

```r

transcript_data <- read.csv("data_transcripts.txt", sep = "\t")
transcript_merge <- merge(mut_transcripts,concepts_ext,by.x="hgvsc",by.y="concept_synonym_name)

#The following is an optional step to remove version information from input transcript HGVSc. 
#This allows for a wide range of older data to be submitted to the vocabulary, but has a small chance of generating false positive matches.

#transcript_data$match_hgvs <- gsub(".[0-9]*:",":",mut_transcripts$HGVSc)
#concepts_ext$match_hgvs <- gsub(".[0-9]*:",":",concepts_ext$concept_synonym_name)
#transcript_merge <- merge(mut_transcripts,concepts_ext,by="match_hgvs")

```

#### Fusions

KOIOS may also be used to match gene fusion data with the relevant concept_ids, such as with cBioPortal gene fusion data (As below).

```r

concepts_fusion <- loadConcepts_fusions()

fusions_data <- read.csv("data_sv.txt", sep = "\t")
fusions_data <- generateFusions_cBioPortal(fusions_data,concepts_fusion)

```

## Getting help

If you encounter a clear bug, please file an issue with a minimal [reproducible example](https://reprex.tidyverse.org/) at the [GitHub issues page](https://github.com/OdyOSG/KOIOS/issues).