Skip to content

Ideas & Feature proposals

Lev Konstantinovskiy edited this page Oct 11, 2016 · 65 revisions

A list of ideas for new functionality and projects in Gensim, topic modelling for humans, a scientific Python package for efficient, large-scale topic modeling.

This page contains an initial short description of a project. For longer, more academic descriptions of projects see our Student Projects page. However, any of the projects below is fit for an Incubator project or for Google Summer of Code. We just didn't have the time yet to expand its description into a longer one.

Gensim's design philosophy builds on data streaming to process very large datasets (larger than RAM; potentially infinite). Data points are processed one at a time, in constant RAM.

This places stringent requirements on the internal algorithms used (online learning, single pass methods) as well as their implementation, to achieve top performance and robustness.

If you'd like to work on any of the topics below, or have your own ideas, get in touch on the gensim mailing list.


Visualizations

Background: Gensim's motto is "topic modelling for humans". Humans like visualizations, and topic modelling lends itself to pie charts and document graphs very naturally.

To do: survey the ways people visualize topic models. Integrate a visualization package, or implement own. Build on top of modern technologies like d3.js, or Continuum's Bokeh.

The scope of this project can range from a few visualization routines to a full HTTP client/server framework, ala Termite.

Example plots: display how topics are composed of words. How documents are composed of topics. How corpus is composed of topics. Make the visualization interactive -- go from words to topics, explore the model.

Resources: Jason Chuang's Termite package. Allison Chaney's TMVE, Topic Model Visualization Engine. pyLDAvis which has been ported from the R package of the same name.

Sanity checks

Background: Gensim newbs sometimes mistakenly pass documents where a whole corpus is assumed. Or pass strings where a list of tokens is assumed etc.

This results in runtime errors, which confuses novices. Even worse, where gensim expects a sequence (~list of tokens) and user passes a string, there is no error (string is also an iterable!) but the individual string letters are silently treated as whole tokens, leading to unexpected results...

To do: Collect/survey common newbie errors and then implement sanity checks that catch them early. This could be a method/set of methods that accept an object and check whether it's really a streamed corpus, whether it's empty, contains strange ids, ids in right order, strange feature weights... In case there's anything fishy, log a warning.

Resources: utils.is_corpus() checks whether the input is a corpus, without destroying the data, in case the content is streamed and non-repeatable. Can serve as a starting point template.

Model selection

Background: Gensim strives for reasonable defaults/automatically tuned parameters, but there are usually a few knobs that can be turned for each model (such as the number of topics). Currently there is no support for automatically evaluating model quality, trying different values for these parameters. Users must write an explicit for-loop, going through desired values and recording the best performing combination.

To do: Design an API that automates going through desired parameter values and records the best combination. Ideally, the evaluation happens in parallel, to speed things up and make better use of multicore/cluster setups.

The API must support streamed input (iterables) and piping several transformations one after another, each with different set of available parameters (e.g. Dictionary's filter_extremes followed by TF-IDF's normalization followed by LSI's num_topics).

Resources: Scikit-learn has a good API for parameter grid search, including parallelization. Maybe integrate the scikit-learn search methods directly if they're powerful enough to do accept streamed data; otherwise reimplement the loops directly.

Parallelization resources: multiprocessing, joblib, celery.

Distributed computing

Background: Gensim contains distributed implementations of several algorithms. The implementations use Pyro4 for network communication and are fairly low-level.

To do: Investigate + integrate one of the higher level frameworks for distributed computation, so that gensim can plug into them without reinventing the wheel. Implement one of the algorithms in gensim in this framework: for example, adapt the distributed online Latent Semantic Analysis or online LDA that are already in gensim.

The solution must support online computation (data coming as a non-repeatable stream of documents); batch processing is not enough.

Integration with a framework that plays well with Python (i.e. avoids Spark's serialisation ) to disk would be better, so Ibis is a good candidate.

Resources: Ibis Celery, Spark, Disco, Storm, Samza. Get in touch on the mailing list/@radimrehurek/@ogrisel.

Distributed similarity queries

Background: gensim implements fast routines for similarity retrieval ("give me documents similar to this one, using Latent Semantic Analysis"). The routines can make use of multiple cores (using BLAS), but not multiple machines. For large datasets, it is desirable to store shards in a distributed manner, across a cluster of computers. During querying, collect and merge results from all shards.

To do: Extend the sharding already present in gensim, so that different shards can reside on different computers. Design an API to make the concept of "shards" flexible, so that similarity classes that use different implementations (see k-NN above) can plug into it easily.

The network communication must use a fast protocol (Pyro? ØMQ?), so as to not increase query latency too much.

Resources: gensim mailing list.

Online NNMF

Background: Non-negative matrix factorization is an algorithm similar to Latent Semantic Analysis/Latent Dirichlet Allocation. It falls into matrix factorization methods and can be phrased as an online learning algorithm.

To do: Implement NNMF in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Online NMF. Gensim github issue #132.

Supervised LDA

Background: Users have requested "supervised Latent Dirichlet Allocation, LDA" in gensim.

To do: Implement sLDA in a scalable manner in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Wang&Blei&Fei's sLDA paper.

Ramage &al's Labeled LDA.

Jagarlamundi's Seeded LDA

Implementation of sLDA on David Blei's page.

Gensim github issue #121.

Explicit Semantic Analysis, ESA

Background: ESA is a different method of unsupervised document analysis, using Wikipedia as a resource. It uses a different type of analysis to the [Bayesian/linear algebra/numerical] models already in gensim, making for a refreshing addition.

To do: Implement ESA in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Explicit Semantic Analysis.

There's a tentative pull request on github #107, which is not up to gensim standards and needs further polishing/engineering.

Dynamic Topic Models improvements

Background: Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents. DTM has already been implemented in Gensim by Google Summer of Code student @bhargavvader

To do: Implement a distributed cluster version of DTM, and a version that can use multiple cores on the same machine. Implement DIM in gensim and evaluate.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Gensim doesn't include any support for "timed streams", or time tags, at the moment. So part of this project will be engineering a clean API for this new functionality.

Resources: Dynamic Topic Models.

Original Blei&Lafferty article PDF.

Wang&Blei&Heckerman article on Continuous Time Dynamic Topic Model PDF.

Wang&McCallum: "Topics over time" PDF.

Academic implementation of DTM on David Blei's page.

Gensim implementation of DTM.

Author-Topic Models

Already being worked on by @olavurmortensen

Background: A generative model for documents that extends LDA to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words.

Applications to computing similarity between authors and entropy of author output.

To do: Implement the author-topic model in gensim and evaluate. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Resources: "The Author-Topic Model for Authors and Documents" by Rosen-Zvi&al PDF.

Christopher Grainger offered help.

Nested Hierarchical Dirichlet Processes

Background: Paisley, Wang, Blei, Jordan recently developed a stochastic variational version of nested HDP. It reportedly preforms better than HDP etc. (of course!).

To do: Implement this model (probably extending / replacing the existing online HDP implementation in gensim) and evaluate it. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors), to allow large inputs.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Resources: "Nested Hierarchical Dirichlet Processes" by John Paisley, Chong Wang, David M. Blei and Michael I. Jordan PDF.

Pachinko Allocation Model

Background Li, McCallum developed a hierarchical LDA-like model for document classification. They report 2-5% accuracy improvements over an LDA model on a test corpus. (http://people.cs.umass.edu/~mccallum/papers/pam-icml06.pdf)

An implementation of this model may provide additional alternatives in choice of model, which in some cases may be helpful.

An implementation must be heavily unit tested and and production-ready. It would use many of the same classes and methods as the LDA, which is a bonus in terms of a first pass at implementation.

Resources Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. (2004). Hierarchical topic models and the nested Chinese restaurant process. NIPS.

Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Diggle, P. J., & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society B, 46, 193–227.

Li, W., Blei, D., & McCallum, A. (2007). Nonparametric Bayes pachinko allocation.

Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. ICML.

Minka, T. (2000). Estimating a Dirichlet distribution. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. UAI.

Wallach, H. M. (2006). Topic modeling: beyond bag-ofwords. ICML.

SparseTools package

A package for working with sparse matrices, built on top of scipy, optimized with Cython, memory-efficient and fast. An improvement and replacement on recently deprecated scipy's sparsetools package.

Should also include faster (Cythonized) transformations between the "gensim streamed corpus" and various formats (scipy.sparse, numpy...). Similar to matutils (https://radimrehurek.com/gensim/matutils.html#gensim.matutils.corpus2csc )

Glove word-embedding integration

Integrate or re-write in an optimized way the glove word-embedding code by Maciej Kula (https://github.com/maciejkula/glove-python). Next step would be adding Swivel algorithm support

WordRank

WordRank is a new word embedding algorithm.

Investigate how it compares to word2vec by expanding on the approach in this blog.

Wrapper for BigARTM Topic Modeling package

See BigARTM github page

Add Montemurro and Zanette algorithm

See https://github.com/RaRe-Technologies/gensim/issues/665

##LargeVis

A technique for Visualizing Large-scale and High-dimensional Data. Faster than t-SNE!

Code in https://github.com/lferry007/LargeVis

https://arxiv.org/abs/1602.00370

Implement VarEmbed: Use morphology for word-embeddings

Very useful in non-English languages.

The paper mentions some Recurrent Neural Network code using blocks package.

http://arxiv.org/pdf/1608.01056.pdf

Spectral LDA in Python

Much better performance than current variational inference way to fit LDA.

Either implement in Python or find a way to load the model trained on Spark.

Code and paper references

Intrinisic evaluation of word2vec models with QVec

Shows how good your word2vec model is on specific syntactic and semantic tasks. Wrapper around this code https://github.com/ytsvetko/qvec

Word sense embedding

A sense embedding is able to learn multiple representations per word capturing different word meanings.

Integrate one of existing word sense embeddings into gensim. Adagram is the best one currently.

Low priority as rarely appears in production.

Consider:

Adagram

Sensegram

Compare tensorflow and gensim summarizations

https://research.googleblog.com/2016/08/text-summarization-with-tensorflow.html

Implement Wordspace in Python

Translate from R into Python using existing Gensim code. Medium difficulty.

From gensim issue suggestion: "Hi, it seems that wordspace model is very useful (http://infomap-nlp.sourceforge.net/doc/algorithm.html and https://cran.r-project.org/web/packages/wordspace/index.html). It is similar to the lsa model except that wordspace decomposes a co-occurrence matrix instead of term-document matrix."

Add cuckoo hashing to dictionary to avoid collisions

Change HashDictionary to use cuckoo hashing.

Hat-tip to A. Mueller

Add word embedding from "Learning Distributed Word Representations For

Bidirectional LSTM Recurrent Neural Network" paper

See paper

Automatic topic labeling

Implement algorithm from the paper Automatic Labeling of Multinomial Topic Models Qiaozhu Mei et al Suggestion from Jason Liu

Joint embedding

Slight modification of word2vec for the purpose of sponsored advertising. See this paper "Joint Embedding of Query and Ad by Leveraging Implicit Feedback"

Implement OHDOCLUS – Online and Hierarchical Document Clustering

See https://github.com/ruiEnca/ohDoclus

Integrate with PhraseMachine - phrase detection based on part of speech tagging

See code and paper at https://github.com/slanglab/phrasemachine

Factor analysis algorithms and code

See how it works, compare to existing techniques, maybe get in touch for inclusion / robust reimplementation in gensim.

Code from 2014: https://github.com/cangermueller/vbmfa

Pivoted normalization for tfidf model

See https://github.com/RaRe-Technologies/gensim/issues/220

###Word2Vec/Doc2Vec: Implement 'Translation Matrix' of 'Exploiting similarities among languages for machine translation'

Section 4 of Mikolov, Le, & Sutskever's paper on word2vec for machine translation describes a way to map words between two separate vector models, as in the example of word vectors induced for two different natural languages.

Section 2.2 of 'Skip-Thought Vectors' uses a similar technique to bootstrap a larger vocabulary in their model, from a pre-existing larger word2vec model.

The same technique could be valuable for adapting to drifting word representations, when training over large datasets over long timeframes. Specifically: as new information introduces extra words, and newer examples of word usage, older words may (and probably should) relocate for the model to continue to perform optimally on the training task, on more-recent text. (In a sense, words should rearrange to 'make room' for the new words and examples.) As these changes accumulate, older representations (or cached byproducts) may not be directly comparable to the latest representations – unless a translation-matrix-like adjustment is made. (The specifics of the translation may also indicate areas of interest, where usage or meanings are changing rapidly.)

Implementation work by Georgiana Dinu, linked from the word2vec homepage, may be relevant if license-compatible. (Update: In correspondence, Dinu has given approval to re-use that code in gensim, if it's helpful.)

Jason of jxieeducation.com blog has also run an experiment suggesting the usefulness of this approach, in this case using sklearn's Linear Regression to learn the projection.

The Procrustes matrix alignment example code by Ryan Heuser based on HistWords by William Hamilton does something similar and may be of direct use, or use as a model.


Word2Vec/Doc2Vec: Add 'Adagrad' Gradient-Descent Option

Some Word2Vec/Doc2Vec papers or projects suggest they've used 'Adagrad' to speed gradient-descent. Having it as an option (for comparative evaluation) and then possibly the default (if it's a clear speed win) would be nice for Word2Vec/Doc2Vec.

Clone this wiki locally