Skip to content
Ludwig Schmidt edited this page Dec 8, 2015 · 25 revisions

Frequently Asked Questions

Q: You say that FALCONN uses LSH. What is LSH?

A: LSH stands for Locality-Sensitive Hashing, which is a way to partition a metric space into nicely shaped cells. We have written a short introduction to LSH, which also contains pointers to further literature.


Q: I read that LSH requires a lot of memory and is thus impractical. Is that true?

A: Yes and no. As defined originally, "vanilla" LSH can indeed sometimes require a lot of memory, and for large datasets this can become a problem. However, researchers have come up with multiprobe LSH, which is a simple and incredibly efficient way to reduce the space used by LSH-based data structures. FALCONN supports multiprobe LSH and works very well in the regime where the entire LSH data structure is about as large or even smaller than the dataset itself. For instance, if you have data set of size 4GB and your machine has 8GB of RAM, FALCONN should not run into memory issues.


Q: FALCONN provides LSH for cosine similarity. How can it be used for the general Euclidean distance?

A: LSH for cosine similarity is good enough if your dataset is "nice". What it means formally is that there is a decent correlation between an angle between two vectors and the corresponding Euclidean distance. In this case, you can partition the dataset using LSH for cosine similarity, and then use the genuine Euclidean distance to find the k-nearest neighbors among the candidates. In the beginning, it is good to center your dataset so that the mean is the origin. It often makes the dataset more isotropic and easier to partition. FALCONN can do this preprocessing for you automatically if desired.


Q: I have heard that pairwise distances in high-dimensional datasets tend to have the same value, which renders LSH useless. Is that true?

A: First, there are quite a few datasets for which the above is not the case. Second, if this is nevertheless the case, and there is no hidden low-dimensional structure, your best bet would be to do the (optimized) linear scan, since it is believed that such instances are hard for the Nearest Neighbor Search in general, not only for the LSH-based methods.


Q: What's the advantage of FALCONN over my favorite data-dependent method for the nearest neighbor search?

A: TBD

Clone this wiki locally