Skip to content

Commit

Permalink
Fix links
Browse files Browse the repository at this point in the history
  • Loading branch information
tyamaguchi-ucla committed May 2, 2024
1 parent f9d49d8 commit 728b25f
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 4 deletions.
2 changes: 1 addition & 1 deletion docs/source/doc_check.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ All checks can be run simultaneously via `hatchet check`, or an individual comma

The check for `compute-cn` runs the step on a set of small data files (.bbc/.seg) pre-packaged with HATCHet, and is a quick way to verify if your solver is working correctly.
If you are unable to run this command, it likely indicates a licensing issue with default (Gurobi) solver. To use alternative solvers, see the
[Using a different Pyomo-supported solver](README.html#usingasolver_other) section of the README for more details.
[Using a different Pyomo-supported solver](README.md#usingasolver_other) section of the README for more details.

## Input

Expand Down
2 changes: 1 addition & 1 deletion docs/source/recommendation_clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ The global clustering performed along the genome and jointly across samples is a

The module `cluster-bins` incorporates genomic position to improve clustering using a Gaussian hidden Markov model (GHMM), as opposed to the position-agnostic Gaussian mixture model (GMM) used in `cluster-bins-gmm` and described in the original HATCHet publication. This page describes how to tune the parameters of `cluster-bins` -- for recommendations on `cluster-bins-gmm`, see [this page](recommendation_old_clustering.md) instead.

The user should validate the results of the clustering, especially in noisy or suspicious cases, through the cluster figures produced by [plot-bins](doc_plot_bins.html) and [plot-bins-1d2d](doc_plot_bins_1d2d.html). More specifically, we suggest the following criteria to evaluate the clustering:
The user should validate the results of the clustering, especially in noisy or suspicious cases, through the cluster figures produced by [plot-bins](doc_plot_bins.md) and [plot-bins-1d2d](doc_plot_bins_1d2d.md). More specifically, we suggest the following criteria to evaluate the clustering:

1. Every pair of clusters should be clearly distinct in terms of RDR or BAF in at least one sample, and
2. Each cluster should contain regions with similar values of RDR and BAF in all samples
Expand Down
4 changes: 2 additions & 2 deletions docs/source/recommendation_datatype.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

The default values in the complete pipeline of HATCHet are typically used for analyzing whole-genome sequencing (WGS) data. However, when considering different type of data, as those from whole-exome sequencing (WES) data, users should adjust some of the parameters due to the different features of this kind of data. More specifically, there are 4 main points to consider when analyzing WES data:

- *Bin sizes*. One can use the plots from [plot-bins](https://github.com/raphael-group/hatchet/blob/master/doc/doc_plot_bins.html) to test different parameters (`--mtr` and `--msr` for variable-width, bin size for fixed width) and inspect the amount of variance and/or the separation between apparent clusters.
- *Bin sizes*. One can use the plots from [plot-bins](https://github.com/raphael-group/hatchet/blob/master/doc/doc_plot_bins.md) to test different parameters (`--mtr` and `--msr` for variable-width, bin size for fixed width) and inspect the amount of variance and/or the separation between apparent clusters.
* **Variable-width** Having a sufficient number of germline SNPs is needed to have good estimations with low variances for RDR and, especially, for the B-allele frequency (BAF) of each bin. Variable-width binning attempts to account for this by adjusting bin widths to ensure enough total and SNP-covering reads in each bin. You can tune the average bin width using the `--msr` (min. SNP-covering reads, default 5000) and `--mtr` (min. total reads, default 5000) parameters to `combine-counts`. Generally, `--msr` is more important because a bin with enough SNP-covering reads to get a good BAF estimate will almost certainly have enough total reads to get a good RDR estimate. Increasing these parameters produces larger bins (on average) with lower variance, while decreasing these values produces smaller bins (on average) with higher variance.
* **Fixed-width (legacy)** While a size of 50kb is standard for CNA analysis when considering whole-genome sequencing (WGS) data, data from whole-exome sequencing (WES) generally require to use large bin sizes in order to guarantee that each bin contains a sufficient number of heterozygous germline SNPs. As such, more appropriate bin sizes to consider may be 200kb or 250k when analyzing WES data; even larger bin sizes, e.g. `500kb`, may be needed for noisy WES data.

- *Read-count thresholds*. As suggested in the GATK best practices, `count-alleles` requires two parameters -c (the minimum coverage for SNPs) and -C (the maximum coverage for SNPs) to reliably call SNPs and exclude those in regions with artifacts. GATK suggests to consider a value of -C that is at least twice larger than the average coverage and -c should be large enough to exclude non-sequenced regions. For example, `-c 6` and `-C 300` are values previously used for WGS data whose coverage is typically between 30x and 90x. However, WES data are generally characterized by a much larger average coverage and thus require larger values, e.g. `-c 20` and `-C 600`. These values are also very usefule to discard off-target regions. In any case, the user should ideally pick values according to the considered data.
- *Read-count thresholds*. As suggested in the GATK best practices, `count-alleles` requires two parameters -c (the minimum coverage for SNPs) and -C (the maximum coverage for SNPs) to reliably call SNPs and exclude those in regions with artifacts. GATK suggests to consider a value of -C that is at least twice larger than the average coverage and -c should be large enough to exclude non-sequenced regions. For example, `-c 6` and `-C 300` are values previously used for WGS data whose coverage is typically between 30x and 90x. However, WES data are generally characterized by a much larger average coverage and thus require larger values, e.g. `-c 20` and `-C 600`. These values are also very useful to discard off-target regions. In any case, the user should ideally pick values according to the considered data.

- *Bootstrapping for clustering*. (legacy `cluster-bins-gmm` only) Occasionally, WES may have very few points and much less data points than WGS. Only in these special cases with very few data points, the global clustering of cluster-bins-gmm may generally benefit from the integrated bootstrapping approach. This approach allow to generate a certain number of synthetic bins from the real ones to increase the power of the clustering. For example, the following cluster-bins-gmm parameters `-u 20 -dR 0.002 -dB 0.002` allow to activate the bootstraping which introduces 20 synthetic bins for each real bin with low variances.

Expand Down

0 comments on commit 728b25f

Please sign in to comment.