Ideas & Feature proposals

A list of ideas for new functionality and projects in Gensim, topic modelling for humans, a scientific Python package for efficient, large-scale topic modeling.

This page contains an initial short description of a project. For longer, more academic descriptions of projects see our Student Projects page. However, any of the projects below is fit for an Incubator project or for Google Summer of Code. We just didn't have the time yet to expand its description into a longer one.

Gensim's design philosophy builds on data streaming to process very large datasets (larger than RAM; potentially infinite). Data points are processed one at a time, in constant RAM.

This places stringent requirements on the internal algorithms used (online learning, single pass methods) as well as their implementation, to achieve top performance and robustness.

If you'd like to work on any of the topics below, or have your own ideas, get in touch on the gensim mailing list.

Online NNMF

Background:

Non-negative matrix factorization, NNMF [1], is a popular machine learning algorithm, widely used in collaborative filtering and natural language processing. It can be phrased as an online learning algorithm. [2]

While implementations of NNMF in Python exist [3, 4], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications. You will contribute a scalable implementation of NNMF to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals:

Demonstrate understanding of matrix factorization theory and practice, by describing, implementing and evaluating a scalable version of the NNMF algorithm.
Implement streamed NNMF [5] that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally also implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables:

Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings and accuracy of your NNMF implementation on English Wikipedia and the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your NNMF implementation. You can also evaluate the NNMF factorization quality against other factorization methods, such as SVD and LDA [9] in collaborative filtering settings (optional).

Resources:

[1] NNMF on Wikipedia

[2] Online algorithm

[3] Christian Thurau et al. "Python Matrix Factorisation"

[4] Sklearn NMF code

[5] Online NMF on Wikipedia

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010

[10] Wang, Tan, König, Li. "Efficient Document Clustering via Online Nonnegative Matrix Factorizations." 2011

[11] Topics extraction with Non-Negative Matrix Factorization in sklearn

[12] Gensim github issue #132.

Explicit Semantic Analysis

Background: Explicit Semantic Analysis [1, 2] is a method of unsupervised document analysis using Wikipedia as a resource. It has many applications, for example event classification on Twitter [3].

While implementations of ESA exist in Python [4] and other languages [5], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.

You will contribute a scalable implementation of ESA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals:

Demonstrate understanding of semantic interpretation theory and practice, by describing, implementing and evaluating a scalable version of the ESA algorithm.
Implement streamed ESA that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings and accuracy of your ESA implementation on the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your ESA implementation. You can also evaluate the ESA against other methods of semantic analysis, such as Latent Semantic Analysis [9, 10] in an event classification task (optional).

Resources:

[1] Evgeniy Gabrilovich and Shaul Markovitch "Wikipedia-based Semantic Interpretation for Natural Language Processing." Journal of Artificial Intelligence Research, 34:443–498, 2009

[2] Explicit Semantic Analysis.

[3] Musaev, A.; De Wang; Shridhar, S.; Chien-An Lai; Pu, C., "Toward a Real-Time Service for Landslide Detection: Augmented Explicit Semantic Analysis and Clustering Composition Approaches," in Web Services (ICWS), 2015 IEEE International Conference on , vol., no., pp.511-518, June 27 2015-July 2 2015

[4] Python implementation of ESA

[5] Gabrilovich's page on ESA

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] "Latent Semantic Analysis" article on Wikipedia

[10] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188

Distributed computing

Background: Gensim contains distributed implementations of several algorithms. The implementations use Pyro4 for network communication and are fairly low-level.

To do: Investigate + integrate one of the higher level frameworks for distributed computation, so that gensim can plug into them without reinventing the wheel. Implement one of the algorithms in gensim in this framework: for example, adapt the distributed online Latent Semantic Analysis or online LDA that are already in gensim.

The solution must support online computation (data coming as a non-repeatable stream of documents); batch processing is not enough.

Integration with a framework that plays well with Python (i.e. avoids Spark's serialisation ) to disk would be better, so Ibis is a good candidate.

Resources: Ibis Celery, Spark, Disco, Storm, Samza. Get in touch on the mailing list/@radimrehurek/@ogrisel.

Sanity checks

Background: Gensim newbs sometimes mistakenly pass documents where a whole corpus is assumed. Or pass strings where a list of tokens is assumed etc.

This results in runtime errors, which confuses novices. Even worse, where gensim expects a sequence (~list of tokens) and user passes a string, there is no error (string is also an iterable!) but the individual string letters are silently treated as whole tokens, leading to unexpected results...

To do: Collect/survey common newbie errors and then implement sanity checks that catch them early. This could be a method/set of methods that accept an object and check whether it's really a streamed corpus, whether it's empty, contains strange ids, ids in right order, strange feature weights... In case there's anything fishy, log a warning.

Resources: utils.is_corpus() checks whether the input is a corpus, without destroying the data, in case the content is streamed and non-repeatable. Can serve as a starting point template.

Documentation Tooling

Difficulty: Medium; requires excellent UX skills and native English

Background: We already have a large number of models, therefore, we want to pay more attention to the model quality (documentation and model discovery being the main thing here). If we have a great model users don't know how (or when) to use - they won't use it! For this reason, we want to significantly improve our documentation.

To do:

[already underway, WIP] Consistent docstrings for all methods and classes in Gensim
An updated new "beginner tutorial chain": an API-centric walk through the Gensim terminology, design choices, ways of doing things the Gensim way, best practices, FAQ
Use-case-centric User-guides for major models and use-case pipelines (sphinx-gallery), focusing on how to solve concrete popular task X
A New slick project website: the current website https://radimrehurek.com/gensim/ is very popular in terms of visitors, but looks embarrassingly dated.
Improved UX: analysis of visitor flow, minimizing clicks for common documentation patterns, a logical structure for all documentation, intuitive navigation, improving information discovery for the different types of visitor types (newbies, API docs, use-case docs, power users…)

Resources:

SparseTools package

See https://github.com/scikit-learn/scikit-learn/issues/6186

A package for working with sparse matrices, built on top of scipy, optimized with Cython, memory-efficient and fast. An improvement and replacement on recently deprecated scipy's sparsetools package.

Should also include faster (Cythonized) transformations between the "gensim streamed corpus" and various formats (scipy.sparse, numpy...). Similar to matutils (https://radimrehurek.com/gensim/matutils.html#gensim.matutils.corpus2csc )

Word sense embedding

A sense embedding is able to learn multiple representations per word capturing different word meanings.

Integrate one of existing word sense embeddings into gensim. Adagram is the best one currently.

Low priority as rarely appears in production.

Consider:

Adagram

Sensegram

Add cuckoo hashing to dictionary to avoid collisions

Change HashDictionary to use cuckoo hashing.

Hat-tip to A. Mueller

Add word embedding from "Learning Distributed Word Representations For

Bidirectional LSTM Recurrent Neural Network" paper

See paper

Automatic topic labeling

Implement algorithm from the paper Automatic Labeling of Multinomial Topic Models Qiaozhu Mei et al Suggestion from Jason Liu

Joint embedding

Slight modification of word2vec for the purpose of sponsored advertising. See this paper "Joint Embedding of Query and Ad by Leveraging Implicit Feedback"

Implement OHDOCLUS – Online and Hierarchical Document Clustering

See https://github.com/ruiEnca/ohDoclus

Integrate with PhraseMachine - phrase detection based on part of speech tagging

See code and paper at https://github.com/slanglab/phrasemachine

Factor analysis algorithms and code

See how it works, compare to existing techniques, maybe get in touch for inclusion / robust reimplementation in gensim.

Code from 2014: https://github.com/cangermueller/vbmfa

Pivoted normalization for tfidf model

See https://github.com/RaRe-Technologies/gensim/issues/220

Word2Vec/Doc2Vec: Implement 'Translation Matrix' of 'Exploiting similarities among languages for machine translation'

Section 4 of Mikolov, Le, & Sutskever's paper on word2vec for machine translation describes a way to map words between two separate vector models, as in the example of word vectors induced for two different natural languages.

Section 2.2 of 'Skip-Thought Vectors' uses a similar technique to bootstrap a larger vocabulary in their model, from a pre-existing larger word2vec model.

The same technique could be valuable for adapting to drifting word representations, when training over large datasets over long timeframes. Specifically: as new information introduces extra words, and newer examples of word usage, older words may (and probably should) relocate for the model to continue to perform optimally on the training task, on more-recent text. (In a sense, words should rearrange to 'make room' for the new words and examples.) As these changes accumulate, older representations (or cached byproducts) may not be directly comparable to the latest representations – unless a translation-matrix-like adjustment is made. (The specifics of the translation may also indicate areas of interest, where usage or meanings are changing rapidly.)

Implementation work by Georgiana Dinu, linked from the word2vec homepage, may be relevant if license-compatible. (Update: In correspondence, Dinu has given approval to re-use that code in gensim, if it's helpful.)

Implementation with normal equations. In paper by Andrey Kutuzov this was successfully used with Gensim to `translate' between Ukrainian and Russian. Code. Can be easily integrated into Gensim.

Jason of jxieeducation.com blog has also run an experiment suggesting the usefulness of this approach, in this case using sklearn's Linear Regression to learn the projection.

The Procrustes matrix alignment example code by Ryan Heuser based on HistWords by William Hamilton does something similar and may be of direct use, or use as a model.

Word2Vec/Doc2Vec: Add 'Adagrad' Gradient-Descent Option

Some Word2Vec/Doc2Vec papers or projects suggest they've used 'Adagrad' to speed gradient-descent. Having it as an option (for comparative evaluation) and then possibly the default (if it's a clear speed win) would be nice for Word2Vec/Doc2Vec.

Test Online Word2vec better

From the mailing list comment

The testing that's occurred with this new feature has really only verified that new tokens are available with at-a-glance somewhat-meaningful vectors. The effect on existing tokens, or relations with tokens that don't appear in later training batches, hasn't been evaluated. (I'm also not sure it's doing the best thing with respect to features like frequent-word downsampling.)

Unsupervised Seq2seq

Implement in tensorflow. Reproduce this paper or try another GAN model

Decompose dense embeddings into sparse interpretable components

Similar to this blog on images of faces

Structural Topic Models

See https://github.com/RaRe-Technologies/gensim/issues/1038

LazySVD

Fast SVD algorithm. Requires knowledge of C and algorithmic optimizations.

https://arxiv.org/abs/1607.03463

Unsupervised segmentation into blocks/sentences/words

See discussion in https://github.com/RaRe-Technologies/gensim/issues/1135#issuecomment-277529491

Integration with shorttext supervised learning package

This integration with sklearn and keras should be a part of gensim: https://github.com/stephenhky/PyShortTextCategorization

SentencePiece: Unsupervised language-agnostic tokenization

Google has recently released Code for SentencePiece algorithm.

Implement a gensim wrapper if it produces a good benchmark against supervised tokenization

It fits in the same space as existing Gensim module Phrases. This module is useful to a lot of people even though it is simple and general.

Port ldatuning metrics to gensim

Very useful metrics for selecting the number of topics in LDA http://rpubs.com/siri/ldatuning

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge

Implement a gensim version of this algorithm. Main features: topic seeding through "anchor words" and hierarchical TM.

Paper: Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge Implementation: /gregversteeg/corex_topic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly