Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document accessing model's vocabulary #2661

Merged
merged 2 commits into from
Nov 1, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@
},
"outputs": [],
"source": [
"lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation\ncorpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi"
"lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation\ncorpus_lsi = lsi_model[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi"
]
},
{
Expand All @@ -134,7 +134,7 @@
},
"outputs": [],
"source": [
"lsi.print_topics(2)"
"lsi_model.print_topics(2)"
]
},
{
Expand Down Expand Up @@ -170,7 +170,7 @@
},
"outputs": [],
"source": [
"lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...\nlsi = models.LsiModel.load('/tmp/model.lsi')"
"import os\nimport tempfile\n\nwith tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:\n lsi_model.save(tmp.name) # same for tfidf, lda, ...\n\nloaded_lsi_model = models.LsiModel.load(tmp.name)\n\nos.unlink(tmp.name)"
]
},
{
Expand Down Expand Up @@ -208,7 +208,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down
16 changes: 11 additions & 5 deletions docs/src/auto_examples/core/run_topics_and_transformations.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,15 +126,15 @@
#
# Transformations can also be serialized, one on top of another, in a sort of chain:

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

###############################################################################
# Here we transformed our Tf-Idf corpus via `Latent Semantic Indexing <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
# into a latent 2-D space (2-D because we set ``num_topics=2``). Now you're probably wondering: what do these two latent
# dimensions stand for? Let's inspect with :func:`models.LsiModel.print_topics`:

lsi.print_topics(2)
lsi_model.print_topics(2)

###############################################################################
# (the topics are printed to log -- see the note at the top of this page about activating
Expand All @@ -152,9 +152,15 @@

###############################################################################
# Model persistency is achieved with the :func:`save` and :func:`load` functions:
import os
import tempfile

lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')
with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
lsi_model.save(tmp.name) # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

###############################################################################
# The next question might be: just how exactly similar are those documents to each other?
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
7f6d3084a74333f89c5c6d06b1cc74fb
844d2cd8ea4d13801165b3af2aecde49
40 changes: 23 additions & 17 deletions docs/src/auto_examples/core/run_topics_and_transformations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -205,8 +205,8 @@ Transformations can also be serialized, one on top of another, in a sort of chai
.. code-block:: default


lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi



Expand All @@ -222,7 +222,7 @@ dimensions stand for? Let's inspect with :func:`models.LsiModel.print_topics`:
.. code-block:: default


lsi.print_topics(2)
lsi_model.print_topics(2)



Expand Down Expand Up @@ -257,15 +257,15 @@ remaining four documents to the first topic:

.. code-block:: none

[(0, 0.06600783396090373), (1, -0.5200703306361856)] Human machine interface for lab abc computer applications
[(0, 0.19667592859142588), (1, -0.7609563167700043)] A survey of user opinion of computer system response time
[(0, 0.08992639972446417), (1, -0.7241860626752514)] The EPS user interface management system
[(0, 0.07585847652178135), (1, -0.6320551586003438)] System and human system engineering testing of EPS
[(0, 0.1015029918498023), (1, -0.573730848300295)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378311), (1, 0.16115180214025807)] The generation of random binary unordered trees
[(0, 0.8774787673119832), (1, 0.16758906864659448)] The intersection graph of paths in trees
[(0, 0.9098624686818579), (1, 0.1408655362871908)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569284), (1, -0.05392907566389287)] Graph minors A survey
[(0, 0.06600783396090627), (1, -0.520070330636184)] Human machine interface for lab abc computer applications
[(0, 0.1966759285914279), (1, -0.760956316770005)] A survey of user opinion of computer system response time
[(0, 0.08992639972446735), (1, -0.7241860626752503)] The EPS user interface management system
[(0, 0.07585847652178428), (1, -0.6320551586003422)] System and human system engineering testing of EPS
[(0, 0.10150299184980327), (1, -0.5737308483002963)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378309), (1, 0.16115180214026148)] The generation of random binary unordered trees
[(0, 0.8774787673119828), (1, 0.16758906864659825)] The intersection graph of paths in trees
[(0, 0.9098624686818573), (1, 0.14086553628719417)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569281), (1, -0.053929075663891594)] Graph minors A survey



Expand All @@ -274,9 +274,15 @@ Model persistency is achieved with the :func:`save` and :func:`load` functions:

.. code-block:: default

import os
import tempfile

lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')
with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
lsi_model.save(tmp.name) # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)



Expand Down Expand Up @@ -429,17 +435,17 @@ References

.. code-block:: none

/Volumes/work/workspace/gensim_misha/docs/src/gallery/core/run_topics_and_transformations.py:287: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
/home/misha/git/gensim/docs/src/gallery/core/run_topics_and_transformations.py:293: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()




.. rst-class:: sphx-glr-timing

**Total running time of the script:** ( 0 minutes 0.743 seconds)
**Total running time of the script:** ( 0 minutes 0.844 seconds)

**Estimated memory usage:** 7 MB
**Estimated memory usage:** 44 MB


.. _sphx_glr_download_auto_examples_core_run_topics_and_transformations.py:
Expand Down
8 changes: 4 additions & 4 deletions docs/src/auto_examples/core/sg_execution_times.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@

Computation times
=================
**00:02.671** total execution time for **auto_examples_core** files:
**00:00.844** total execution time for **auto_examples_core** files:

- **00:01.265**: :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` (``run_core_concepts.py``)
- **00:00.743**: :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``)
- **00:00.663**: :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``)
- **00:00.844**: :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``)
- **00:00.000**: :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` (``run_core_concepts.py``)
- **00:00.000**: :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``)
- **00:00.000**: :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``)
4 changes: 2 additions & 2 deletions docs/src/auto_examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -452,13 +452,13 @@ Blog posts, tutorial videos, hackathons and other useful Gensim resources, from

.. container:: sphx-glr-download

:download:`Download all examples in Python source code: auto_examples_python.zip <//Volumes/work/workspace/gensim_misha/docs/src/auto_examples/auto_examples_python.zip>`
:download:`Download all examples in Python source code: auto_examples_python.zip <//home/misha/git/gensim/docs/src/auto_examples/auto_examples_python.zip>`



.. container:: sphx-glr-download

:download:`Download all examples in Jupyter notebooks: auto_examples_jupyter.zip <//Volumes/work/workspace/gensim_misha/docs/src/auto_examples/auto_examples_jupyter.zip>`
:download:`Download all examples in Jupyter notebooks: auto_examples_jupyter.zip <//home/misha/git/gensim/docs/src/auto_examples/auto_examples_jupyter.zip>`


.. only:: html
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 38 additions & 2 deletions docs/src/auto_examples/tutorials/run_word2vec.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,24 @@
"import gensim.downloader as api\nwv = api.load('word2vec-google-news-300')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common operation is to retrieve the vocabulary of a model. That is trivial:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for i, word in enumerate(wv.vocab):\n if i == 10:\n break\n print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -87,7 +105,7 @@
},
"outputs": [],
"source": [
"try:\n vec_weapon = wv['cameroon']\nexcept KeyError:\n print(\"The word 'cameroon' does not appear in this model\")"
"try:\n vec_cameroon = wv['cameroon']\nexcept KeyError:\n print(\"The word 'cameroon' does not appear in this model\")"
]
},
{
Expand Down Expand Up @@ -198,6 +216,24 @@
"vec_king = model.wv['king']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Retrieving the vocabulary works the same way:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for i, word in enumerate(model.wv.vocab):\n if i == 10:\n break\n print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -548,7 +584,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down
16 changes: 15 additions & 1 deletion docs/src/auto_examples/tutorials/run_word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,13 @@
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

###############################################################################
# A common operation is to retrieve the vocabulary of a model. That is trivial:
for i, word in enumerate(wv.vocab):
if i == 10:
break
print(word)

###############################################################################
# We can easily obtain vectors for terms the model is familiar with:
#
Expand All @@ -145,7 +152,7 @@
# out the FastText model.
#
try:
vec_weapon = wv['cameroon']
vec_cameroon = wv['cameroon']
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
except KeyError:
print("The word 'cameroon' does not appear in this model")

Expand Down Expand Up @@ -220,6 +227,13 @@ def __iter__(self):
#
vec_king = model.wv['king']

###############################################################################
# Retrieving the vocabulary works the same way:
for i, word in enumerate(model.wv.vocab):
if i == 10:
break
print(word)

###############################################################################
# Storing and loading models
# --------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/src/auto_examples/tutorials/run_word2vec.py.md5
Original file line number Diff line number Diff line change
@@ -1 +1 @@
776cde9e7148f94e2cbff78b00854edd
0d41144f740af100c7576b2284b03d0a
Loading