-
Notifications
You must be signed in to change notification settings - Fork 3
/
Readme.Rmd
151 lines (92 loc) · 4.66 KB
/
Readme.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "README"
output: md_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
<p float="left">
<img src="./img/koios.png" style="vertical-align: center;" width="100"/><img src="./img/ods_logo.jpg" style="vertical-align: center;" width="100"/>
</p>
## Overview
KOIOS is a tool developed by [Odysseus Data Services Inc](https://odysseusinc.com/) that allows users to combine their variant data with the OMOP Genomic Vocabulary in order to generate a set of genomic standard concept IDs from raw patient-level genomic data.
## Installation
KOIOS can presently be installed directly from GitHub:
``` r
# install.packages("devtools")
devtools::install_github("odyOSG/KOIOS")
```
## Usage
### userScript.R
The file userScript.R may be loaded as a default workflow wherein only the initial reference genome and VCF file or VCF files directory need be specified.
### Manual
Users must provide at least one valid VCF file in either .vcf or .vcf.gz format. This may be in the form of a single file, or a directory containing a set of .vcf or .vcf.gz files.
Users may simply run KOIOS according to the following simple pipeline:
``` r
library(KOIOS)
#Load the OMOP Genomic Vocabulary into R
concepts <- loadConcepts()
#Specify input file or directort
vcf <- loadVCF(userVCF = "Input.vcf")
#Specify and load human reference genome, if known
ref <- "hg19"
ref.df <- loadReference(ref)
#Process VCF and generate all relevant HGVSG identifiers for input records
vcf.df <- processVCF(vcf)
vcf.df <- generateHGVSG(vcf = vcf.df, ref = ref.df)
vcf.df <- processClinGen(vcf.df, ref = ref, progressBar = F)
#Combine this output data with the OMOP Genomic vocab to produce a DF containing a list of concept codes
vcf.df <- addConcepts(vcf.df, concepts, returnAll = T)
```
If the user is unaware of the reference genome used to generate a given VCF file they may run the following command, which checks their VCF variants against known ClinGen variants.
``` r
vcf <- loadVCF(userVCF = "Input VCF")
ref <- "auto"
ref <- findReference(vcf)
ref.df <- loadReference(ref)
```
### Multi-VCF Pipeline
Multiple VCF files within a single directory may be submitted simultaneously within a single command:
```r
#Load the VCF directory
vcf <- loadVCF(userVCF = "SomeDirectory/")
#Set ref to hg19
ref <- "hg19"
concepts.df <- multiVCFPipeline(vcf, ref, generateTranscripts, concepts)
```
While it is possible to use the automatic reference finder for multiple files, it is not recommended due to the long runtime.
### Other Data Formats
It is also possible to run KOIOS on VCF-like data formats, with examples detailed below. An appropriate reference is required, as with VCF data.
#### cBioPortal mutations data
```r
mutations <- read.csv("data_mutations.txt", sep = "\t")
#reference information is likely stored in mutations$NCBI_Build
mut_vcf <- processcBioPortal(mutations)
mut_vcf <- processClinGen(mut_vcf, ref = ref, progressBar = F)
mut_vcf <- addConcepts(mut_vcf,concepts)
```
#### HGVSG
HGVSg data can be directly read into KOIOS and submitted via the processClinGen function. A minimal HGVSg dataframe input requires a column named "hgvsg".
```r
hgvsg <- read.csv("hgvsg.csv", sep = "\t")
hgvsg <- processClingen(hgvsg,ref=ref)
```
#### HGVSc and transcript/protein data
Data already formatted into transcript (HGVSc) or protein (HGVSp) formats, such as with cBioPortal input data (As below), may also be submitted to KOIOS.
These data are simply matched directly with the extended concepts object, derived from the OMOP Genomic vocabulary.
```r
transcript_data <- read.csv("data_transcripts.txt", sep = "\t")
transcript_merge <- merge(mut_transcripts,concepts_ext,by.x="hgvsc",by.y="concept_synonym_name)
#The following is an optional step to remove version information from input transcript HGVSc.
#This allows for a wide range of older data to be submitted to the vocabulary, but has a small chance of generating false positive matches.
#transcript_data$match_hgvs <- gsub(".[0-9]*:",":",mut_transcripts$HGVSc)
#concepts_ext$match_hgvs <- gsub(".[0-9]*:",":",concepts_ext$concept_synonym_name)
#transcript_merge <- merge(mut_transcripts,concepts_ext,by="match_hgvs")
```
#### Fusions
KOIOS may also be used to match gene fusion data with the relevant concept_ids, such as with cBioPortal gene fusion data (As below).
```r
concepts_fusion <- loadConcepts_fusions()
fusions_data <- read.csv("data_sv.txt", sep = "\t")
fusions_data <- generateFusions_cBioPortal(fusions_data,concepts_fusion)
```
## Getting help
If you encounter a clear bug, please file an issue with a minimal [reproducible example](https://reprex.tidyverse.org/) at the [GitHub issues page](https://github.com/OdyOSG/KOIOS/issues).