Skip to content
Gautam Kamath edited this page Dec 10, 2015 · 25 revisions

Frequently Asked Questions

If your question is not answered here, just send us an email to falconn.lib@gmail.com. We'd be happy to help!


Q: You say that FALCONN uses LSH. What is LSH?

A: LSH stands for Locality-Sensitive Hashing, which is a way to partition a metric space into nicely shaped cells. We have written a short introduction to LSH, which also contains pointers to further literature.


Q: I read that LSH requires a lot of memory and is thus impractical. Is that true?

A: Yes and no. As defined originally, "vanilla" LSH can indeed sometimes require a lot of memory, and for large datasets this can become a problem. However, researchers have come up with multiprobe LSH, which is a simple and efficient way to reduce the space used by LSH-based data structures. FALCONN supports multiprobe LSH and works very well in the regime where the entire LSH data structure is about as large or even smaller than the dataset itself. For instance, if you have a data set of size 4GB and your machine has 8GB of RAM, FALCONN should not run into memory issues.


Q: FALCONN provides LSH for cosine similarity. How can it be used for the general Euclidean distance?

A: LSH for cosine similarity is good enough if your dataset is "nice". What it means formally is that there is a decent correlation between an angle between two vectors and the corresponding Euclidean distance. In this case, you can partition the dataset using LSH for cosine similarity, and then use the genuine Euclidean distance to find the k-nearest neighbors among the candidates. In the beginning, it is good to center your dataset so that the mean is the origin. It often makes the dataset more isotropic and easier to partition.


Q: I have heard that pairwise distances in high-dimensional datasets tend to have the same value, which renders LSH useless. Is that true?

A: The above statement has been formally proven for random data sets. Most applications involve data sets that are not random but instead posses some structure. As a result there are many "real" datasets to which this statement does not apply. Second, if this is nevertheless the case and your data is essentially random, your best bet would likely be to use the (optimized) linear scan. It is conjectured that such instances are hard for the Nearest Neighbor Search in general (not only for the LSH-based methods), i.e., that they require linear time to answer queries.


Q: Will FALCONN work well on my data?

A: That depends on the dataset. One property that enables fast nearest neighbor search is that a typical distance to a nearest neighbor is way smaller than a typical distance between two data points. If you have trouble getting FALCONN to work well, send us an email.


Q: How should I set the parameters?

A: See the corresponding section in the LSH Primer. We also provide a function set_up_parameters that sets the parameters meaningfully for nice enough datasets (see How to use FALCONN for details).


Q: How long does it take to set up a FALCONN data structure?

A: Usually, building the data structure is very quick and almost always it's way faster than what it takes for other similarity search software. Eventually, we plan to publish more detailed benchmarks, but let's just say that for several million points in several hundred dimensions the set-up time typically takes around several minutes.

Clone this wiki locally