Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online NMF #2007

Merged
merged 161 commits into from
Jan 17, 2019
Merged
Show file tree
Hide file tree
Changes from 92 commits
Commits
Show all changes
161 commits
Select commit Hold shift + click to select a range
343e46f
Implement first version of the algorithm
anotherbugmaster Mar 29, 2018
3171be3
Fix variable names
anotherbugmaster Mar 30, 2018
bd325bc
Add support for streaming corpora
anotherbugmaster Apr 2, 2018
19b3ba4
Add benchmark
anotherbugmaster Apr 2, 2018
9e52399
Fix bugs, introduce batches, add images to the benchmark notebook
anotherbugmaster Apr 15, 2018
c54fc92
Update notebook
anotherbugmaster Apr 22, 2018
6dc9d3e
Improve model
anotherbugmaster Apr 22, 2018
0554b7b
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Apr 22, 2018
5f4b3d3
Add show topics, change API
anotherbugmaster Apr 23, 2018
52fc956
Add more LDA-like API
anotherbugmaster Apr 23, 2018
ddebcf0
Fix logger name
anotherbugmaster Apr 23, 2018
6d0a1b3
Add more LDA API
anotherbugmaster Apr 23, 2018
cf430fc
Remove redundant method
anotherbugmaster Apr 23, 2018
df5a6e9
Remove commented out lines
anotherbugmaster Apr 23, 2018
25080b4
Fix flakes
anotherbugmaster Apr 23, 2018
83b1a6b
Cythonize
anotherbugmaster May 2, 2018
7f27f52
Dramatically improve performance
anotherbugmaster May 22, 2018
405e12f
Add parameters, improve accuracy and speed
anotherbugmaster Jun 2, 2018
7b45b23
Remove redundant W copying
anotherbugmaster Jun 5, 2018
a154a6e
Fix random seed again
anotherbugmaster Jun 5, 2018
e82628d
Optimize E/M step
anotherbugmaster Jun 12, 2018
1ca33f8
Add an eval_every option, use softmax for normalization
anotherbugmaster Jun 13, 2018
f19e6ce
Fixes
anotherbugmaster Jun 13, 2018
583cb15
Improve notebook examples a bit
anotherbugmaster Jun 13, 2018
fe0ab0a
Fix eval_every
anotherbugmaster Jun 13, 2018
8e647a1
Return outliers
anotherbugmaster Jun 16, 2018
89cc803
Optimizations
anotherbugmaster Jun 16, 2018
bbd3099
Experimenting with loss
anotherbugmaster Jun 16, 2018
f71ad89
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Aug 14, 2018
936e629
Fix PEP8
anotherbugmaster Aug 14, 2018
1c3a064
Return nmf import
anotherbugmaster Aug 14, 2018
ce4b7ee
Revert "Return nmf import"
anotherbugmaster Aug 20, 2018
f8de1d9
Fix
anotherbugmaster Aug 27, 2018
df9b8c7
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Aug 27, 2018
d159779
Fix minimum_probability & info -> debug logs
anotherbugmaster Aug 27, 2018
3dcdedc
Compute metrics
anotherbugmaster Aug 27, 2018
f11f2e2
Count error on-the-fly
anotherbugmaster Aug 28, 2018
8216541
Speed optimizations, changed error functions
anotherbugmaster Aug 28, 2018
ee3a7c7
Beat LDA
anotherbugmaster Aug 28, 2018
a3315f2
Outperform sklearn in speed (WTF)
anotherbugmaster Aug 28, 2018
3a03ff9
Remove redundant arg
anotherbugmaster Aug 28, 2018
70619e1
Add Olivietti faces
anotherbugmaster Aug 28, 2018
8c47ce0
Remove redundant code
anotherbugmaster Aug 28, 2018
e291664
Add Topics
anotherbugmaster Aug 28, 2018
3302b92
Make it pretty
anotherbugmaster Aug 28, 2018
5616bd6
Fix wrapper
anotherbugmaster Aug 28, 2018
ed8f29f
Save corpus & dict, minor fixes
anotherbugmaster Aug 30, 2018
2117c90
Add RandomCorpus
anotherbugmaster Aug 31, 2018
950115d
Dense -> sparse
anotherbugmaster Aug 31, 2018
54993c6
First doc2dense
anotherbugmaster Aug 31, 2018
572dc6c
Fix csc again
anotherbugmaster Aug 31, 2018
d40d89f
Fix len
anotherbugmaster Aug 31, 2018
7a3ef47
Experimenting
anotherbugmaster Sep 12, 2018
f94de09
Revert "Experimenting"
anotherbugmaster Sep 12, 2018
9ed2167
Fix evaluation
anotherbugmaster Sep 12, 2018
ad9443f
Sparse speedup
anotherbugmaster Sep 23, 2018
1a04660
Improve performance
anotherbugmaster Sep 25, 2018
87981bf
Divide A and B again
anotherbugmaster Sep 25, 2018
0b314c7
Fix A and B computation bug
anotherbugmaster Sep 25, 2018
b024dd6
Sparsify W init
anotherbugmaster Sep 25, 2018
35d5406
Experimenting
anotherbugmaster Sep 25, 2018
74acb37
New norm
anotherbugmaster Sep 25, 2018
8b28675
Sparse threshold -> sparse coefficient
anotherbugmaster Sep 25, 2018
588ef6a
Optimize residuals computation
anotherbugmaster Sep 26, 2018
8f84758
Fix residuals bug
anotherbugmaster Sep 26, 2018
8a67c44
W speedup
anotherbugmaster Sep 26, 2018
560f2bf
Experiment
anotherbugmaster Sep 26, 2018
cac2590
Revert changes a bit
anotherbugmaster Sep 26, 2018
060ab28
Fix corpus
anotherbugmaster Sep 26, 2018
cde937f
Fix init error|
anotherbugmaster Sep 26, 2018
66b753f
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster Sep 26, 2018
18dbb6b
Resolve conflict
anotherbugmaster Sep 26, 2018
4b49d26
Fix corpus iteration issue
anotherbugmaster Sep 26, 2018
9c6cbc6
Switch to numpy algos
anotherbugmaster Oct 7, 2018
b23d016
Merge upstream
anotherbugmaster Oct 7, 2018
74ba37d
Train on wikipedia
anotherbugmaster Oct 7, 2018
c943264
Sparse coef -> density. More stable way to sparsify W matrix
anotherbugmaster Oct 9, 2018
a489807
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster Oct 9, 2018
a95e345
Return old sparse algo
anotherbugmaster Oct 9, 2018
0f90484
Max
anotherbugmaster Oct 9, 2018
6ae43e4
Optimizations
anotherbugmaster Oct 10, 2018
335170b
Fix A and B computation
anotherbugmaster Oct 10, 2018
4cc8f1b
Fix A and B normalization
anotherbugmaster Oct 10, 2018
5c6fe60
Add random_state
anotherbugmaster Oct 23, 2018
dd459a2
Infer id2word
anotherbugmaster Oct 23, 2018
5121d85
Fix tests
anotherbugmaster Nov 6, 2018
5f4018a
Document __init__
anotherbugmaster Nov 14, 2018
dbd8474
Document whole nmf
anotherbugmaster Nov 14, 2018
5904f10
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Nov 14, 2018
cd4b9b0
Remove unnecessary comments
anotherbugmaster Nov 14, 2018
53a02a9
Add tutorial notebook
anotherbugmaster Nov 14, 2018
937e340
Document __init__
anotherbugmaster Nov 20, 2018
26a87bd
Fix flake version
anotherbugmaster Nov 28, 2018
261c13a
Fix flake warning
anotherbugmaster Nov 28, 2018
0147afc
Remove comments, reverse parallelization order
anotherbugmaster Nov 28, 2018
1ece3c1
Add NMF's cython extension to setup.py
anotherbugmaster Nov 28, 2018
e6409fa
Fix imports, add solve_r function
anotherbugmaster Nov 28, 2018
0743624
Remove comments
anotherbugmaster Nov 28, 2018
fd8088b
Add docstrings
anotherbugmaster Nov 28, 2018
e4ba0de
Common corpus and common dictionary
anotherbugmaster Nov 28, 2018
8537eef
Remove redundant test
anotherbugmaster Nov 28, 2018
d2e8385
Add signature flag
anotherbugmaster Nov 28, 2018
b72bf39
Add files to manifest
anotherbugmaster Nov 28, 2018
ed080a3
Fix flake8
anotherbugmaster Nov 29, 2018
67f6e75
Fix atol value
anotherbugmaster Nov 29, 2018
ee4373d
Implement top topics
anotherbugmaster Nov 29, 2018
d01c88c
Add rst files
anotherbugmaster Dec 10, 2018
8111080
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Dec 11, 2018
3de3646
Fix appveyor issue
anotherbugmaster Dec 11, 2018
183ea2d
Fix cython error
anotherbugmaster Dec 11, 2018
d2ac199
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Dec 12, 2018
2d664c6
Fix fmax/fmin not being on win-python27
anotherbugmaster Dec 12, 2018
c9a3577
Add word transformation test
anotherbugmaster Dec 12, 2018
fd0de20
Improve readability of residuals computation
anotherbugmaster Dec 21, 2018
fa384f2
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Dec 21, 2018
a811c67
Fix tests
anotherbugmaster Dec 21, 2018
d063a4f
A few fixes
anotherbugmaster Dec 21, 2018
b8f5d79
Blank line at the end of each docstring
anotherbugmaster Dec 21, 2018
361d160
Add blank line
anotherbugmaster Dec 21, 2018
e214582
Add the paper reference
anotherbugmaster Dec 21, 2018
9527f39
Fix long line
anotherbugmaster Dec 21, 2018
e1e1168
Add log_perplexity
anotherbugmaster Dec 30, 2018
3bf5be3
Merge remote-tracking branch 'remotes/upstream/develop' into online_nmf
anotherbugmaster Jan 7, 2019
d1c6e3e
Add NMF and LDA comparison table
anotherbugmaster Jan 9, 2019
7927b6b
Change the sign of log perplexity
anotherbugmaster Jan 9, 2019
1c6517e
Add Sklearn NMF comparison
anotherbugmaster Jan 9, 2019
278fb05
Merge sklearn and tm tables
anotherbugmaster Jan 9, 2019
a330327
Add F1
anotherbugmaster Jan 10, 2019
7ba9b84
Remove _solve_r
anotherbugmaster Jan 10, 2019
a14bfd3
Merge tutorial and benchmark
anotherbugmaster Jan 10, 2019
d28aef3
Identation's back
anotherbugmaster Jan 10, 2019
83ec0f6
Optimize optimizers
anotherbugmaster Jan 10, 2019
d25332f
Remove unnecessary pic
anotherbugmaster Jan 10, 2019
0e711d9
Optimize memory consumption
anotherbugmaster Jan 10, 2019
cc3085c
Add docstring
anotherbugmaster Jan 10, 2019
b090b6b
Optimize get_topic_words
anotherbugmaster Jan 10, 2019
e05a1c6
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Jan 10, 2019
ba8ce1c
Fix tests
anotherbugmaster Jan 10, 2019
6d78f83
Fix flake8
anotherbugmaster Jan 10, 2019
b16c1dd
Add missing test
anotherbugmaster Jan 11, 2019
7c1e240
Code review fixes
anotherbugmaster Jan 11, 2019
667ae99
n_tokens -> num_tokens
anotherbugmaster Jan 11, 2019
251d5f9
[skip ci] Add explicit normalize parameter
anotherbugmaster Jan 11, 2019
7a3f358
[skip ci] Add explicit normalize parameter[2]
anotherbugmaster Jan 11, 2019
c663f33
[skip ci] Update tutorial notebook
anotherbugmaster Jan 11, 2019
8e15cd4
[skip ci] [WIP] Update wikipedia notebook
anotherbugmaster Jan 11, 2019
3c76171
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster Jan 15, 2019
4941745
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Jan 15, 2019
c4d6ebd
Add more description and metrics
anotherbugmaster Jan 15, 2019
3b1195d
[skip ci] Fix log_probabiliy
anotherbugmaster Jan 15, 2019
5edec1b
Multiple format fixes in notebook, outputs cleared til tomorrow
anotherbugmaster Jan 15, 2019
33ce1a3
Merge remote-tracking branch 'upstream/develop' into online_nmf
menshikh-iv Jan 16, 2019
1806bf6
Train on full corpus
anotherbugmaster Jan 16, 2019
3b9b8ea
Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …
anotherbugmaster Jan 16, 2019
3f1af1d
[skip ci] Remove disclaimer
anotherbugmaster Jan 16, 2019
38143a9
Add RAM usage stats
anotherbugmaster Jan 16, 2019
72a02db
Native 20-newsgroups and additional text
anotherbugmaster Jan 16, 2019
7cf80e1
Truncate outputs
anotherbugmaster Jan 17, 2019
72178c0
Merge remote-tracking branch 'upstream/develop' into online_nmf
anotherbugmaster Jan 17, 2019
467a2ad
Fix last cell formatting
anotherbugmaster Jan 17, 2019
e34b939
[skip ci] Change model hyperparameters back
anotherbugmaster Jan 17, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,513 changes: 3,513 additions & 0 deletions docs/notebooks/nmf-wikipedia.ipynb

Large diffs are not rendered by default.

1,159 changes: 1,159 additions & 0 deletions docs/notebooks/nmf_benchmark.ipynb

Large diffs are not rendered by default.

302 changes: 302 additions & 0 deletions docs/notebooks/nmf_tutorial.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing:

  • how to infer vector for word \ document
  • description of internal state (matrices, where\what stored)
  • benchmark table \ plot
  • (?) Keras example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@menshikh-iv menshikh-iv Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing/need to fix

  • what we have in NMF in addition (normalization for using it as TM),
  • "Training" header duplicated
  • Can you be a bit more "descriptive" (not only headers, also text descriptions)
  • Benchmark require some conclusions (not just show a table, here you have many metrics & many parameters)
  • Parameter tuning
  • What happens with gensim NMF faces in notebook?

"cells": [
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing/need to fix

  • what we have in NMF in addition (normalization for using it as TM),
  • "Training" header duplicated
  • Can you be a bit more "descriptive" (not only headers, also text descriptions)
  • Benchmark require some conclusions (not just show a table, here you have many metrics & many parameters)
  • Parameter tuning
  • What happens with gensim NMF faces in notebook?

"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial on Online Non-Negative Matrix Factorization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebooks explains basic ideas behind NMF implementation, training examples and use-cases."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"**Matrix Factorizations** are useful for many things: recomendation systems, bi-clustering, image compression and, in particular, topic modeling.\n",
"\n",
"Why **Non-Negative**? It makes the problem more strict and allows us to apply some optimizations.\n",
"\n",
"Why **Online**? Because corpora are large and RAM is limited. Online NMF can learn topics iteratively.\n",
"\n",
"This particular implementation is based on [this paper](arxiv.org/abs/1604.02634)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"from gensim import matutils\n",
"from gensim.models.nmf import Nmf\n",
"from gensim.models import CoherenceModel\n",
"from gensim.parsing.preprocessing import preprocess_string\n",
"from sklearn.datasets import fetch_20newsgroups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataset preprocessing"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"categories = [\n",
" 'alt.atheism',\n",
" 'comp.graphics',\n",
" 'rec.motorcycles',\n",
" 'talk.politics.mideast',\n",
" 'sci.space'\n",
"]\n",
"\n",
"trainset = fetch_20newsgroups(subset='train', categories=categories, random_state=42)\n",
"testset = fetch_20newsgroups(subset='test', categories=categories, random_state=42)\n",
"\n",
"train_documents = [preprocess_string(doc) for doc in trainset.data]\n",
"test_documents = [preprocess_string(doc) for doc in testset.data]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dictionary compilation"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"from gensim.corpora import Dictionary\n",
"\n",
"dictionary = Dictionary(train_documents)\n",
"\n",
"dictionary.filter_extremes()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Corpora compilation"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"train_corpus = [\n",
" dictionary.doc2bow(document)\n",
" for document\n",
" in train_documents\n",
"]\n",
"\n",
"test_corpus = [\n",
" dictionary.doc2bow(document)\n",
" for document\n",
" in test_documents\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training\n",
"\n",
"The API works in the way similar to [Gensim.models.LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html).\n",
"\n",
"Specific parameters:\n",
"\n",
"- `use_r` - whether to use residuals. Effectively adds regularization to the model\n",
"- `kappa` - optimizer step size coefficient.\n",
"- `lambda_` - residuals coefficient. The larger it is, the less more regularized result gets.\n",
"- `sparse_coef` - internal matrices sparse coefficient. The more it is, the faster and less accurate training is."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 12.4 s, sys: 1.08 s, total: 13.5 s\n",
"Wall time: 13.7 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"nmf = Nmf(\n",
" corpus=train_corpus,\n",
" chunksize=1000,\n",
" num_topics=5,\n",
" id2word=dictionary,\n",
" passes=5,\n",
" eval_every=10,\n",
" minimum_probability=0,\n",
" random_state=42,\n",
" use_r=True,\n",
" lambda_=1000,\n",
" kappa=1,\n",
" sparse_coef=3\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Topics"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(0,\n",
" '0.035*\"god\" + 0.030*\"atheist\" + 0.021*\"believ\" + 0.020*\"exist\" + 0.019*\"atheism\" + 0.016*\"religion\" + 0.013*\"christian\" + 0.013*\"religi\" + 0.013*\"peopl\" + 0.012*\"argument\"'),\n",
" (1,\n",
" '0.055*\"imag\" + 0.054*\"jpeg\" + 0.033*\"file\" + 0.024*\"gif\" + 0.021*\"color\" + 0.019*\"format\" + 0.015*\"program\" + 0.014*\"version\" + 0.013*\"bit\" + 0.012*\"us\"'),\n",
" (2,\n",
" '0.053*\"space\" + 0.034*\"launch\" + 0.024*\"satellit\" + 0.017*\"nasa\" + 0.016*\"orbit\" + 0.013*\"year\" + 0.012*\"mission\" + 0.011*\"data\" + 0.010*\"commerci\" + 0.010*\"market\"'),\n",
" (3,\n",
" '0.022*\"armenian\" + 0.021*\"peopl\" + 0.020*\"said\" + 0.018*\"know\" + 0.011*\"sai\" + 0.011*\"went\" + 0.010*\"come\" + 0.010*\"like\" + 0.010*\"apart\" + 0.009*\"azerbaijani\"'),\n",
" (4,\n",
" '0.024*\"graphic\" + 0.017*\"pub\" + 0.015*\"mail\" + 0.013*\"data\" + 0.013*\"ftp\" + 0.012*\"send\" + 0.011*\"imag\" + 0.011*\"rai\" + 0.010*\"object\" + 0.010*\"com\"')]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nmf.show_topics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Coherence"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1.6698708891486376"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CoherenceModel(\n",
" model=nmf,\n",
" corpus=test_corpus,\n",
" coherence='u_mass'\n",
").get_coherence()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Perplexity"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def perplexity(model, corpus):\n",
" W = model.get_topics().T\n",
"\n",
" H = np.zeros((W.shape[1], len(corpus)))\n",
" for bow_id, bow in enumerate(corpus):\n",
" for topic_id, proba in model[bow]:\n",
" H[topic_id, bow_id] = proba\n",
" \n",
" dense_corpus = matutils.corpus2dense(corpus, W.shape[0])\n",
" \n",
" return np.exp(-(np.log(W.dot(H), where=W.dot(H)>0) * dense_corpus).sum() / dense_corpus.sum())\n",
"\n",
"perplexity(nmf, test_corpus)"
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".py",
"format_name": "percent",
"format_version": "1.1",
"jupytext_version": "0.8.3"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Binary file added docs/notebooks/stars_scaled.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading