diff --git a/CHANGELOG.md b/CHANGELOG.md index 56eef755ca..2f54207056 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ This release contains a major refactoring. * No more wheels for x32 platforms (if you need x32 binaries, please build them yourself). (__[menshikh-iv](https://github.com/menshikh-iv)__, [#6](https://github.com/RaRe-Technologies/gensim-wheels/pull/6)) * Speed up random number generation in word2vec model (PR [#2864](https://github.com/RaRe-Technologies/gensim/pull/2864), __[@zygm0nt](https://github.com/zygm0nt)__) +* Remove Keras dependency (PR [#2937](https://github.com/RaRe-Technologies/gensim/pull/2937), __[@piskvorky](https://github.com/piskvorky)__) ### :books: Tutorial and doc improvements diff --git a/docs/notebooks/keras_wrapper.ipynb b/docs/notebooks/keras_wrapper.ipynb deleted file mode 100644 index a99ed262b5..0000000000 --- a/docs/notebooks/keras_wrapper.ipynb +++ /dev/null @@ -1,273 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Using wrappers for Gensim models for working with Keras" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This tutorial is about using gensim models as a part of your Keras models." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The wrappers available (as of now) are :\n", - "* Word2Vec (uses the function ```get_keras_embedding``` defined in ```gensim.models.keyedvectors```)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Word2Vec" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "collapsed": true - }, - "source": [ - "#### Integration with Keras : 20NewsGroups Task" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To see how Gensim's Word2Vec model could be integrated with Keras while dealing with a supervised (classification) task, we consider the [20NewsGroups](qwone.com/~jason/20Newsgroups/) task. Here, we take a smaller version of this data by taking a subset of the documents to be classified. \n", - "\n", - "First, we import the necessary modules." - ] - }, - { - "cell_type": "code", - "execution_count": 163, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import sys\n", - "import keras\n", - "import numpy as np\n", - "\n", - "from gensim.models import word2vec\n", - "\n", - "from keras.models import Model\n", - "from keras.preprocessing.text import Tokenizer, text_to_word_sequence\n", - "from keras.preprocessing.sequence import pad_sequences\n", - "from keras.utils.np_utils import to_categorical\n", - "from keras.layers import Input, Dense, Flatten\n", - "from keras.layers import Conv1D, MaxPooling1D\n", - "\n", - "from sklearn.datasets import fetch_20newsgroups" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We first load the training data.\n", - "Then, we format our text samples and labels into tensors that can be fed into a neural network. To do this, we rely on Keras utilities `keras.preprocessing.text.Tokenizer`, `keras.preprocessing.sequence.pad_sequences` and `from keras.utils.np_utils import to_categorical`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 164, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics', 'sci.space'])\n", - "\n", - "MAX_SEQUENCE_LENGTH = 1000\n", - "\n", - "# Vectorize the text samples into a 2D integer tensor\n", - "tokenizer = Tokenizer()\n", - "tokenizer.fit_on_texts(dataset.data)\n", - "sequences = tokenizer.texts_to_sequences(dataset.data)\n", - "\n", - "x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n", - "y_train = to_categorical(np.asarray(dataset.target))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we train a Word2Vec model from the documents we have.\n", - "From the word2vec model we construct the embedding layer to be used in our actual Keras model.\n", - "\n", - "The Keras tokenizer object maintains an internal vocabulary (a token to index mapping), which might be different from the vocabulary gensim builds when training the word2vec model. To align the vocabularies we pass the Keras tokenizer vocabulary to the `get_keras_embedding` function" - ] - }, - { - "cell_type": "code", - "execution_count": 165, - "metadata": {}, - "outputs": [], - "source": [ - "keras_w2v = word2vec.Word2Vec([text_to_word_sequence(doc) for doc in dataset.data],min_count=0)\n", - "embedding_layer = keras_w2v.wv.get_keras_embedding(word_index = tokenizer.word_index,train_embeddings=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, we create a small 1D convnet to solve our classification problem." - ] - }, - { - "cell_type": "code", - "execution_count": 166, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train on 1491 samples, validate on 166 samples\n", - "Epoch 1/3\n", - "1491/1491 [==============================] - 16s 11ms/step - loss: 1.0239 - acc: 0.5017 - val_loss: 0.9306 - val_acc: 0.5663\n", - "Epoch 2/3\n", - "1491/1491 [==============================] - 15s 10ms/step - loss: 0.6941 - acc: 0.7015 - val_loss: 0.6612 - val_acc: 0.7048\n", - "Epoch 3/3\n", - "1491/1491 [==============================] - 15s 10ms/step - loss: 0.4270 - acc: 0.8404 - val_loss: 0.5119 - val_acc: 0.7892\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 166, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')\n", - "embedded_sequences = embedding_layer(sequence_input)\n", - "x = Conv1D(128, 5, activation='relu')(embedded_sequences)\n", - "x = MaxPooling1D(5)(x)\n", - "x = Conv1D(128, 5, activation='relu')(x)\n", - "x = MaxPooling1D(5)(x)\n", - "x = Conv1D(128, 5, activation='relu')(x)\n", - "x = MaxPooling1D(35)(x) # global max pooling\n", - "x = Flatten()(x)\n", - "x = Dense(128, activation='relu')(x)\n", - "preds = Dense(y_train.shape[1], activation='softmax')(x)\n", - "\n", - "model = Model(sequence_input, preds)\n", - "model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])\n", - "\n", - "model.fit(x_train, y_train, epochs=3, validation_split= 0.1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We see that the model learns to reaches a reasonable accuracy, considering the small dataset.\n", - "\n", - "Alternatively, we can use embeddings pretrained on a different larger corpus (Glove), to see if performance impoves" - ] - }, - { - "cell_type": "code", - "execution_count": 167, - "metadata": {}, - "outputs": [], - "source": [ - "import gensim.downloader as api\n", - "\n", - "glove_embeddings = api.load(\"glove-wiki-gigaword-100\")" - ] - }, - { - "cell_type": "code", - "execution_count": 168, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train on 1491 samples, validate on 166 samples\n", - "Epoch 1/3\n", - "1491/1491 [==============================] - 17s 11ms/step - loss: 1.0564 - acc: 0.4514 - val_loss: 0.9083 - val_acc: 0.4578\n", - "Epoch 2/3\n", - "1491/1491 [==============================] - 16s 11ms/step - loss: 0.5122 - acc: 0.7901 - val_loss: 0.3278 - val_acc: 0.8855\n", - "Epoch 3/3\n", - "1491/1491 [==============================] - 16s 10ms/step - loss: 0.0902 - acc: 0.9718 - val_loss: 0.2187 - val_acc: 0.9398\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 168, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "glove_embedding_layer = glove_embeddings.get_keras_embedding(word_index = tokenizer.word_index,train_embeddings=True)\n", - "\n", - "embedded_sequences = glove_embedding_layer(sequence_input)\n", - "x = Conv1D(128, 5, activation='relu')(embedded_sequences)\n", - "x = MaxPooling1D(5)(x)\n", - "x = Conv1D(128, 5, activation='relu')(x)\n", - "x = MaxPooling1D(5)(x)\n", - "x = Conv1D(128, 5, activation='relu')(x)\n", - "x = MaxPooling1D(35)(x) # global max pooling\n", - "x = Flatten()(x)\n", - "x = Dense(128, activation='relu')(x)\n", - "preds = Dense(y_train.shape[1], activation='softmax')(x)\n", - "\n", - "model = Model(sequence_input, preds)\n", - "model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])\n", - "\n", - "model.fit(x_train, y_train, epochs=3, validation_split= 0.1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We see that pretrained embeddings result in a faster convergence" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 1 -} diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py index 520536bd65..e42c46cc7c 100644 --- a/gensim/models/keyedvectors.py +++ b/gensim/models/keyedvectors.py @@ -1588,47 +1588,8 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut self.vectors_lockf[self.get_index(word)] = lockf # lock-factor: 0.0=no changes logger.info("merged %d vectors into %s matrix from %s", overlap_count, self.wv.vectors.shape, fname) - def get_keras_embedding(self, train_embeddings=False): - """Get a Keras 'Embedding' layer with weights set as the Word2Vec model's learned word embeddings. - - Parameters - ---------- - train_embeddings : bool - If False, the weights are frozen and stopped from being updated. - If True, the weights can/will be further trained/updated. - - Returns - ------- - `keras.layers.Embedding` - Embedding layer. - - Raises - ------ - ImportError - If `Keras `_ not installed. - - Warnings - -------- - Current method work only if `Keras `_ installed. - - """ - try: - from keras.layers import Embedding - except ImportError: - raise ImportError("Please install Keras to use this function") - weights = self.vectors - - # set `trainable` as `False` to use the pretrained word embedding - # No extra mem usage here as `Embedding` layer doesn't create any new matrix for weights - layer = Embedding( - input_dim=weights.shape[0], output_dim=weights.shape[1], - weights=[weights], trainable=train_embeddings - ) - return layer - def _upconvert_old_d2vkv(self): """Convert a deserialized older Doc2VecKeyedVectors instance to latest generic KeyedVectors""" - self.vocab = self.doctags self._upconvert_old_vocab() # destroys 'vocab', fills 'key_to_index' & 'extras' for k in self.key_to_index.keys(): @@ -1636,7 +1597,7 @@ def _upconvert_old_d2vkv(self): true_index = old_offset + self.max_rawint + 1 self.key_to_index[k] = true_index del self.expandos['offset'] # no longer needed - if(self.max_rawint > -1): + if self.max_rawint > -1: self.index_to_key = list(range(0, self.max_rawint + 1)) + self.offset2doctag else: self.index_to_key = self.offset2doctag diff --git a/gensim/test/test_keras_integration.py b/gensim/test/test_keras_integration.py deleted file mode 100644 index 6dbe3fa4e6..0000000000 --- a/gensim/test/test_keras_integration.py +++ /dev/null @@ -1,152 +0,0 @@ -import unittest - -import numpy as np - -try: - from sklearn.datasets import fetch_20newsgroups -except ImportError: - raise unittest.SkipTest("Test requires sklearn to be installed, which is not available") - -try: - import keras - from keras.engine import Input - from keras.models import Model - from keras.layers.merge import dot - from keras.preprocessing.text import Tokenizer - from keras.preprocessing.sequence import pad_sequences - from keras.utils.np_utils import to_categorical - from keras.layers import Dense, Flatten - from keras.layers import Conv1D, MaxPooling1D -except ImportError: - raise unittest.SkipTest("Test requires Keras to be installed, which is not available") - -from gensim.test.utils import common_texts -from gensim.models import word2vec - - -@unittest.skip("FIXME strange Keras errors in py3.7+") -class TestKerasWord2VecWrapper(unittest.TestCase): - def setUp(self): - self.model_cos_sim = word2vec.Word2Vec(common_texts, vector_size=100, min_count=1, hs=1) - self.model_twenty_ng = word2vec.Word2Vec(min_count=1) - - def testWord2VecTraining(self): - """ - Test word2vec training. - """ - model = self.model_cos_sim - self.assertTrue(model.wv.vectors.shape == (len(model.wv), 100)) - self.assertTrue(model.syn1.shape == (len(model.wv), 100)) - sims = model.wv.most_similar('graph', topn=10) - # self.assertTrue(sims[0][0] == 'trees', sims) # most similar - - # test querying for "most similar" by vector - graph_vector = model.wv.get_vector('graph', norm=True) - sims2 = model.wv.most_similar(positive=[graph_vector], topn=11) - sims2 = [(w, sim) for w, sim in sims2 if w != 'graph'] # ignore 'graph' itself - self.assertEqual(sims, sims2) - - def testEmbeddingLayerCosineSim(self): - """ - Test Keras 'Embedding' layer returned by 'get_embedding_layer' function for a simple word similarity task. - """ - keras_w2v_model = self.model_cos_sim - keras_w2v_model_wv = keras_w2v_model.wv - - embedding_layer = keras_w2v_model_wv.get_keras_embedding() - - input_a = Input(shape=(1,), dtype='int32', name='input_a') - input_b = Input(shape=(1,), dtype='int32', name='input_b') - embedding_a = embedding_layer(input_a) - embedding_b = embedding_layer(input_b) - similarity = dot([embedding_a, embedding_b], axes=2, normalize=True) - - model = Model(inputs=[input_a, input_b], outputs=similarity) - model.compile(optimizer='sgd', loss='mse') - - word_a = 'graph' - word_b = 'trees' - output = model.predict([ - np.asarray([keras_w2v_model.wv.get_index(word_a)]), - np.asarray([keras_w2v_model.wv.get_index(word_b)]) - ]) - # output is the cosine distance between the two words (as a similarity measure) - - self.assertTrue(type(output[0][0][0]) == np.float32) # verify that a float is returned - - def testEmbeddingLayer20NewsGroup(self): - """ - Test Keras 'Embedding' layer returned by 'get_embedding_layer' function - for a smaller version of the 20NewsGroup classification problem. - """ - MAX_SEQUENCE_LENGTH = 1000 - - # Prepare text samples and their labels - - # Processing text dataset - texts = [] # list of text samples - texts_w2v = [] # used to train the word embeddings - labels = [] # list of label ids - - data = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics', 'sci.space']) - for index in range(len(data)): - label_id = data.target[index] - file_data = data.data[index] - i = file_data.find('\n\n') # skip header - if i > 0: - file_data = file_data[i:] - try: - curr_str = str(file_data) - sentence_list = curr_str.split('\n') - for sentence in sentence_list: - sentence = (sentence.strip()).lower() - texts.append(sentence) - texts_w2v.append(sentence.split(' ')) - labels.append(label_id) - except Exception: - pass - - # Vectorize the text samples into a 2D integer tensor - tokenizer = Tokenizer() - tokenizer.fit_on_texts(texts) - sequences = tokenizer.texts_to_sequences(texts) - - # word_index = tokenizer.word_index - data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) - labels = to_categorical(np.asarray(labels)) - - x_train = data - y_train = labels - - # prepare the embedding layer using the wrapper - keras_w2v = self.model_twenty_ng - keras_w2v.build_vocab(texts_w2v) - keras_w2v.train(texts, total_examples=keras_w2v.corpus_count, epochs=keras_w2v.epochs) - keras_w2v_wv = keras_w2v.wv - embedding_layer = keras_w2v_wv.get_keras_embedding() - - # create a 1D convnet to solve our classification task - sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') - embedded_sequences = embedding_layer(sequence_input) - x = Conv1D(128, 5, activation='relu')(embedded_sequences) - x = MaxPooling1D(5)(x) - x = Conv1D(128, 5, activation='relu')(x) - x = MaxPooling1D(5)(x) - x = Conv1D(128, 5, activation='relu')(x) - x = MaxPooling1D(35)(x) # global max pooling - x = Flatten()(x) - x = Dense(128, activation='relu')(x) - preds = Dense(y_train.shape[1], activation='softmax')(x) - - model = Model(sequence_input, preds) - model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc']) - fit_ret_val = model.fit(x_train, y_train, epochs=1) - - # verify the type of the object returned after training - # value returned is a `History` instance. - # Its `history` attribute contains all information collected during training. - self.assertTrue(type(fit_ret_val) == keras.callbacks.History) - - -if __name__ == '__main__': - unittest.main() diff --git a/setup.py b/setup.py index 36918726d1..29984959e4 100644 --- a/setup.py +++ b/setup.py @@ -280,13 +280,6 @@ def run(self): # Add additional requirements for testing on Linux that are skipped on Windows. linux_testenv = core_testenv[:] + visdom_req + ['pyemd', ] -if sys.version_info >= (3, 7): - # HACK: Installing tensorflow causes a segfault in Travis on py3.6. Other Pythons work – a mystery. - # See https://github.com/RaRe-Technologies/gensim/pull/2814#issuecomment-621477948 - linux_testenv += [ - 'tensorflow', - 'keras==2.3.1', - ] # Skip problematic/uninstallable packages (& thus related conditional tests) in Windows builds. # We still test them in Linux via Travis, see linux_testenv above.