Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add euclidean distance calculation option for bitwise.dist #172

Closed
4 tasks done
zkamvar opened this issue Feb 4, 2018 · 2 comments
Closed
4 tasks done

Add euclidean distance calculation option for bitwise.dist #172

zkamvar opened this issue Feb 4, 2018 · 2 comments

Comments

@zkamvar
Copy link
Member

zkamvar commented Feb 4, 2018

The current bitwise.dist() does not have the option to calculate euclidean distance. This can be done currently by converting the data to a matrix and using R's dist() function, but may be prohibitive on larger data sets due to memory. It may be possible to implement this relatively easily. If there is no missing data, for haploids, we simply need to take the square root of the distance. For diploids, we do the same, but we also have to ensure that the distances are squared at each locus. This will involve converting all of the 2 distances to 4. This can be done in get_distance_custom():

poppr/src/bitwise_distance.c

Lines 2130 to 2145 in 3994ed3

int get_distance_custom(char sim_set, struct zygosity *z1, struct zygosity *z2)
{
int dist = 0;
char Hor;
char S;
char ch_dist;
S = sim_set;
Hor = z1->ch | z2->ch;
ch_dist = Hor | S; // Force ones everywhere they are the same
dist = get_zeros(S); // Add one distance for every non-shared zygosity
dist += get_zeros(ch_dist); // Add another one for every difference that has no heterozygotes
return dist;
}

On line 2142, if we multiply the result of get_zeroes(ch_dist) by 3, then it will be equivalent to squaring the distance. Here's a small proof of this in R:

M <- matrix(sample(0:2, 200, replace = TRUE), nrow = 2)
dist(M)
#>          1
#> 2 12.12436
d <- apply(M, 2, dist)
sqrt( sum(d ^ 2) )
#> [1] 12.12436
any_different <- sum(d > 0)
all_different <- sum(d > 1)
sqrt( any_different + (3 * all_different) )
#> [1] 12.12436

Created on 2018-02-04 by the reprex
package
(v0.1.1.9000).

Missing data

The challenge comes to what happens to missing data. The problem is that, to match R's dist() function, comparisons with missing data are re-scaled to n(n - x) where n is the number of sites and x is the amount of missing sites in the comparison. This will involve counting up the number of missing sites while they are being accounted for while constructing the distance. This should only involve one extra variable.

Tasks

  • add euclidean argument to bitwise.dist()
  • add euclidean argument to get_distance_custom()
  • take the square root of the result (in bitwise.dist())
  • add counter for missing data in bitwise_dist_haploid() and bitwise_dist_diploid()
@zkamvar
Copy link
Member Author

zkamvar commented Feb 4, 2018

Note: here is the stackoverflow question that informed me about the scaling for missing data: https://stackoverflow.com/q/18117174/2752888

@zkamvar
Copy link
Member Author

zkamvar commented Apr 6, 2018

This was fixed in #176

@zkamvar zkamvar closed this as completed Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant