Skip to content
Ilya Razenshteyn edited this page Dec 8, 2015 · 25 revisions

Frequently Asked Questions

  • Q: You say that FALCONN uses LSH. What is LSH anyway?
  • A: LSH is locality-sensitive hashing, a way to partition a metric space into nicely-shaped cells. See a brief LSH primer for an introduction.
  • Q: My favorite paper about the nearest neighbor search claims that LSH requires lots of memory and is thus impractical. Is that true?
  • A: Yes and no. As defined originally, LSH indeed can require lots of memory and for large datasets that may be infeasible. However, in 2007, researchers came up with a simple and incredibly efficient way to reduce the space used by the LSH-based data structure, called multiprobe LSH. FALCONN supports multiprobe LSH and works very well with the whole data structure being as small as the dataset itself or even smaller.
  • Q: FALCONN provides LSH for cosine similarity. How come you claim it can be used for the general Euclidean distance?
  • A: LSH for cosine similarity is good enough if your dataset is reasonable. What it means formally is that there is a decent correlation between an angle between two vectors and the corresponding Euclidean distance. In this case, you can partition the dataset using LSH for cosine similarity, and then use the genuine Euclidean distance to find the K-nearest neighbors among the candidates. In the beginning, it is good to re-center your dataset so that the center of mass is the origin. It often makes the dataset more isotropic and easier to partition. FALCONN provides both of these features.
  • Q: There is a word on the street that in most of the interesting high-dimensional datasets all the pairwise distances are well-concentrated around the same value, which renders LSH useless. Is that true?
  • A: (To our surprise, this is quite a common question!) First, there are quite a few datasets for which the above is not the case. Second, if this is nevertheless the case, and there is no hidden low-dimensional structure, your best bet would be to do the (optimized) linear scan, since we believe that such instances are hard for the Nearest Neighbor Search in general, not only for the LSH-based methods.
  • Q: What's the advantage of FALCONN over my favorite data-dependent method for the nearest neighbor search?
  • A: TBD
  • Q: How are FALCONN and LSHKIT related?
  • A: TBD
Clone this wiki locally