How to remove a word completely from a Word2Vec model in gensim?












5















Given a model, e.g.



from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)


It's possible to remove the word from the w2v vocabulary, e.g.



# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"


But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.



>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]


How to remove a word completely from a Word2Vec model in gensim?





Updated



To answer @vumaasha's comment:




could you give some details as to why you want to delete a word





  • Lets say my universe of words in all words in the corpus to learn the dense relations between all words.


  • But when I want to generate the similar words, it should only come from a subset of domain specific word.


  • It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.


  • It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.











share|improve this question

























  • could you give some details as to why you want to delete a word

    – vumaasha
    Feb 23 '18 at 6:11











  • Sorry the motivation to delete a word is too long to type as a comment, see updated question. It shouldn't be hard to just remove a word totally from the embedding matrix. Just that there seems to be something I'm missing and not sure how it can be removed. Maybe it's because it's not possible to remove since the similarity is already sort of hard-baked into the huffman tree per word.

    – alvas
    Feb 23 '18 at 6:18













  • do you have a complete list of domain specific keywords that you want to get in similarity results?

    – vumaasha
    Feb 23 '18 at 6:34











  • Yes, I do. But please note that removing them before training would have removed the relations of the words outside of the domain, so that's not desirable. They have to be removed after training. Think of the model as a pre-trained model and it's meant to adapt to a domain but I'm not implying full-blown transfer learning here.

    – alvas
    Feb 23 '18 at 6:40


















5















Given a model, e.g.



from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)


It's possible to remove the word from the w2v vocabulary, e.g.



# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"


But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.



>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]


How to remove a word completely from a Word2Vec model in gensim?





Updated



To answer @vumaasha's comment:




could you give some details as to why you want to delete a word





  • Lets say my universe of words in all words in the corpus to learn the dense relations between all words.


  • But when I want to generate the similar words, it should only come from a subset of domain specific word.


  • It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.


  • It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.











share|improve this question

























  • could you give some details as to why you want to delete a word

    – vumaasha
    Feb 23 '18 at 6:11











  • Sorry the motivation to delete a word is too long to type as a comment, see updated question. It shouldn't be hard to just remove a word totally from the embedding matrix. Just that there seems to be something I'm missing and not sure how it can be removed. Maybe it's because it's not possible to remove since the similarity is already sort of hard-baked into the huffman tree per word.

    – alvas
    Feb 23 '18 at 6:18













  • do you have a complete list of domain specific keywords that you want to get in similarity results?

    – vumaasha
    Feb 23 '18 at 6:34











  • Yes, I do. But please note that removing them before training would have removed the relations of the words outside of the domain, so that's not desirable. They have to be removed after training. Think of the model as a pre-trained model and it's meant to adapt to a domain but I'm not implying full-blown transfer learning here.

    – alvas
    Feb 23 '18 at 6:40
















5












5








5


1






Given a model, e.g.



from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)


It's possible to remove the word from the w2v vocabulary, e.g.



# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"


But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.



>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]


How to remove a word completely from a Word2Vec model in gensim?





Updated



To answer @vumaasha's comment:




could you give some details as to why you want to delete a word





  • Lets say my universe of words in all words in the corpus to learn the dense relations between all words.


  • But when I want to generate the similar words, it should only come from a subset of domain specific word.


  • It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.


  • It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.











share|improve this question
















Given a model, e.g.



from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)


It's possible to remove the word from the w2v vocabulary, e.g.



# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"


But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.



>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]


How to remove a word completely from a Word2Vec model in gensim?





Updated



To answer @vumaasha's comment:




could you give some details as to why you want to delete a word





  • Lets say my universe of words in all words in the corpus to learn the dense relations between all words.


  • But when I want to generate the similar words, it should only come from a subset of domain specific word.


  • It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.


  • It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.








python dictionary word2vec gensim del






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 23 '18 at 6:17







alvas

















asked Feb 23 '18 at 5:26









alvasalvas

44.2k63242456




44.2k63242456













  • could you give some details as to why you want to delete a word

    – vumaasha
    Feb 23 '18 at 6:11











  • Sorry the motivation to delete a word is too long to type as a comment, see updated question. It shouldn't be hard to just remove a word totally from the embedding matrix. Just that there seems to be something I'm missing and not sure how it can be removed. Maybe it's because it's not possible to remove since the similarity is already sort of hard-baked into the huffman tree per word.

    – alvas
    Feb 23 '18 at 6:18













  • do you have a complete list of domain specific keywords that you want to get in similarity results?

    – vumaasha
    Feb 23 '18 at 6:34











  • Yes, I do. But please note that removing them before training would have removed the relations of the words outside of the domain, so that's not desirable. They have to be removed after training. Think of the model as a pre-trained model and it's meant to adapt to a domain but I'm not implying full-blown transfer learning here.

    – alvas
    Feb 23 '18 at 6:40





















  • could you give some details as to why you want to delete a word

    – vumaasha
    Feb 23 '18 at 6:11











  • Sorry the motivation to delete a word is too long to type as a comment, see updated question. It shouldn't be hard to just remove a word totally from the embedding matrix. Just that there seems to be something I'm missing and not sure how it can be removed. Maybe it's because it's not possible to remove since the similarity is already sort of hard-baked into the huffman tree per word.

    – alvas
    Feb 23 '18 at 6:18













  • do you have a complete list of domain specific keywords that you want to get in similarity results?

    – vumaasha
    Feb 23 '18 at 6:34











  • Yes, I do. But please note that removing them before training would have removed the relations of the words outside of the domain, so that's not desirable. They have to be removed after training. Think of the model as a pre-trained model and it's meant to adapt to a domain but I'm not implying full-blown transfer learning here.

    – alvas
    Feb 23 '18 at 6:40



















could you give some details as to why you want to delete a word

– vumaasha
Feb 23 '18 at 6:11





could you give some details as to why you want to delete a word

– vumaasha
Feb 23 '18 at 6:11













Sorry the motivation to delete a word is too long to type as a comment, see updated question. It shouldn't be hard to just remove a word totally from the embedding matrix. Just that there seems to be something I'm missing and not sure how it can be removed. Maybe it's because it's not possible to remove since the similarity is already sort of hard-baked into the huffman tree per word.

– alvas
Feb 23 '18 at 6:18







Sorry the motivation to delete a word is too long to type as a comment, see updated question. It shouldn't be hard to just remove a word totally from the embedding matrix. Just that there seems to be something I'm missing and not sure how it can be removed. Maybe it's because it's not possible to remove since the similarity is already sort of hard-baked into the huffman tree per word.

– alvas
Feb 23 '18 at 6:18















do you have a complete list of domain specific keywords that you want to get in similarity results?

– vumaasha
Feb 23 '18 at 6:34





do you have a complete list of domain specific keywords that you want to get in similarity results?

– vumaasha
Feb 23 '18 at 6:34













Yes, I do. But please note that removing them before training would have removed the relations of the words outside of the domain, so that's not desirable. They have to be removed after training. Think of the model as a pre-trained model and it's meant to adapt to a domain but I'm not implying full-blown transfer learning here.

– alvas
Feb 23 '18 at 6:40







Yes, I do. But please note that removing them before training would have removed the relations of the words outside of the domain, so that's not desirable. They have to be removed after training. Think of the model as a pre-trained model and it's meant to adapt to a domain but I'm not implying full-blown transfer learning here.

– alvas
Feb 23 '18 at 6:40














3 Answers
3






active

oldest

votes


















2














There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.



The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done



limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
dists = dot(limited, mean)
if not topn:
return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)


Update:



limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]


If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited



the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below



        self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work






share|improve this answer


























  • Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…

    – alvas
    Feb 24 '18 at 3:43






  • 1





    check my update in the answer

    – vumaasha
    Feb 24 '18 at 4:03



















0














Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.



Suppose you only want to keep the top 5000 words in your model.



wv = w2v_model.wv
words_to_trim = wv.index2word[5000:]
# In op's case
# words_to_trim = ['graph']
ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

for w in words_to_trim:
del wv.vocab[w]

wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
wv.init_sims(replace=True)

for i in sorted(ids_to_trim, reverse=True):
del(wv.index2word[i])


This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.



The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.






share|improve this answer

































    0














    I wrote a function which removes the word from KeyedVectors which are not in a predefined word list.



    def restrict_w2v(w2v, restricted_word_set):
    new_vectors =
    new_vocab = {}
    new_index2entity =
    new_vectors_norm =

    for i in range(len(w2v.vocab)):
    word = w2v.index2entity[i]
    vec = w2v.vectors[i]
    vocab = w2v.vocab[word]
    vec_norm = w2v.vectors_norm[i]
    if word in restricted_word_set:
    vocab.index = len(new_index2entity)
    new_index2entity.append(word)
    new_vocab[word] = vocab
    new_vectors.append(vec)
    new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = new_vectors
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    w2v.vectors_norm = new_vectors_norm


    It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.



    Usage:



    w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
    w2v.most_similar("beer")



    [('beers', 0.8409687876701355),

    ('lager', 0.7733745574951172),

    ('Beer', 0.71753990650177),

    ('drinks', 0.668931245803833),

    ('lagers', 0.6570086479187012),

    ('Yuengling_Lager', 0.655455470085144),

    ('microbrew', 0.6534324884414673),

    ('Brooklyn_Lager', 0.6501551866531372),

    ('suds', 0.6497018337249756),

    ('brewed_beer', 0.6490240097045898)]




    restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
    restrict_w2v(w2v, restricted_word_set)
    w2v.most_similar("beer")



    [('lagers', 0.6570085287094116),

    ('wine', 0.6217695474624634),

    ('bash', 0.20583480596542358),

    ('computer', 0.06677375733852386),

    ('python', 0.005948573350906372)]







    share|improve this answer








    New contributor




    zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.




















      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48941648%2fhow-to-remove-a-word-completely-from-a-word2vec-model-in-gensim%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2














      There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.



      The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
      dists = dot(limited, mean)
      if not topn:
      return dists
      best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)


      Update:



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]


      If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited



      the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below



              self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


      so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work






      share|improve this answer


























      • Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…

        – alvas
        Feb 24 '18 at 3:43






      • 1





        check my update in the answer

        – vumaasha
        Feb 24 '18 at 4:03
















      2














      There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.



      The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
      dists = dot(limited, mean)
      if not topn:
      return dists
      best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)


      Update:



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]


      If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited



      the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below



              self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


      so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work






      share|improve this answer


























      • Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…

        – alvas
        Feb 24 '18 at 3:43






      • 1





        check my update in the answer

        – vumaasha
        Feb 24 '18 at 4:03














      2












      2








      2







      There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.



      The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
      dists = dot(limited, mean)
      if not topn:
      return dists
      best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)


      Update:



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]


      If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited



      the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below



              self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


      so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work






      share|improve this answer















      There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.



      The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
      dists = dot(limited, mean)
      if not topn:
      return dists
      best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)


      Update:



      limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]


      If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited



      the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below



              self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


      so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 24 '18 at 4:02

























      answered Feb 23 '18 at 14:43









      vumaashavumaasha

      1,22231631




      1,22231631













      • Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…

        – alvas
        Feb 24 '18 at 3:43






      • 1





        check my update in the answer

        – vumaasha
        Feb 24 '18 at 4:03



















      • Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…

        – alvas
        Feb 24 '18 at 3:43






      • 1





        check my update in the answer

        – vumaasha
        Feb 24 '18 at 4:03

















      Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…

      – alvas
      Feb 24 '18 at 3:43





      Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…

      – alvas
      Feb 24 '18 at 3:43




      1




      1





      check my update in the answer

      – vumaasha
      Feb 24 '18 at 4:03





      check my update in the answer

      – vumaasha
      Feb 24 '18 at 4:03













      0














      Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.



      Suppose you only want to keep the top 5000 words in your model.



      wv = w2v_model.wv
      words_to_trim = wv.index2word[5000:]
      # In op's case
      # words_to_trim = ['graph']
      ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

      for w in words_to_trim:
      del wv.vocab[w]

      wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
      wv.init_sims(replace=True)

      for i in sorted(ids_to_trim, reverse=True):
      del(wv.index2word[i])


      This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.



      The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.






      share|improve this answer






























        0














        Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.



        Suppose you only want to keep the top 5000 words in your model.



        wv = w2v_model.wv
        words_to_trim = wv.index2word[5000:]
        # In op's case
        # words_to_trim = ['graph']
        ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

        for w in words_to_trim:
        del wv.vocab[w]

        wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
        wv.init_sims(replace=True)

        for i in sorted(ids_to_trim, reverse=True):
        del(wv.index2word[i])


        This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.



        The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.






        share|improve this answer




























          0












          0








          0







          Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.



          Suppose you only want to keep the top 5000 words in your model.



          wv = w2v_model.wv
          words_to_trim = wv.index2word[5000:]
          # In op's case
          # words_to_trim = ['graph']
          ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

          for w in words_to_trim:
          del wv.vocab[w]

          wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
          wv.init_sims(replace=True)

          for i in sorted(ids_to_trim, reverse=True):
          del(wv.index2word[i])


          This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.



          The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.






          share|improve this answer















          Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.



          Suppose you only want to keep the top 5000 words in your model.



          wv = w2v_model.wv
          words_to_trim = wv.index2word[5000:]
          # In op's case
          # words_to_trim = ['graph']
          ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

          for w in words_to_trim:
          del wv.vocab[w]

          wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
          wv.init_sims(replace=True)

          for i in sorted(ids_to_trim, reverse=True):
          del(wv.index2word[i])


          This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.



          The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 22 '18 at 23:10

























          answered Dec 22 '18 at 22:52









          Feng MaiFeng Mai

          8771120




          8771120























              0














              I wrote a function which removes the word from KeyedVectors which are not in a predefined word list.



              def restrict_w2v(w2v, restricted_word_set):
              new_vectors =
              new_vocab = {}
              new_index2entity =
              new_vectors_norm =

              for i in range(len(w2v.vocab)):
              word = w2v.index2entity[i]
              vec = w2v.vectors[i]
              vocab = w2v.vocab[word]
              vec_norm = w2v.vectors_norm[i]
              if word in restricted_word_set:
              vocab.index = len(new_index2entity)
              new_index2entity.append(word)
              new_vocab[word] = vocab
              new_vectors.append(vec)
              new_vectors_norm.append(vec_norm)

              w2v.vocab = new_vocab
              w2v.vectors = new_vectors
              w2v.index2entity = new_index2entity
              w2v.index2word = new_index2entity
              w2v.vectors_norm = new_vectors_norm


              It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.



              Usage:



              w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
              w2v.most_similar("beer")



              [('beers', 0.8409687876701355),

              ('lager', 0.7733745574951172),

              ('Beer', 0.71753990650177),

              ('drinks', 0.668931245803833),

              ('lagers', 0.6570086479187012),

              ('Yuengling_Lager', 0.655455470085144),

              ('microbrew', 0.6534324884414673),

              ('Brooklyn_Lager', 0.6501551866531372),

              ('suds', 0.6497018337249756),

              ('brewed_beer', 0.6490240097045898)]




              restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
              restrict_w2v(w2v, restricted_word_set)
              w2v.most_similar("beer")



              [('lagers', 0.6570085287094116),

              ('wine', 0.6217695474624634),

              ('bash', 0.20583480596542358),

              ('computer', 0.06677375733852386),

              ('python', 0.005948573350906372)]







              share|improve this answer








              New contributor




              zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.

























                0














                I wrote a function which removes the word from KeyedVectors which are not in a predefined word list.



                def restrict_w2v(w2v, restricted_word_set):
                new_vectors =
                new_vocab = {}
                new_index2entity =
                new_vectors_norm =

                for i in range(len(w2v.vocab)):
                word = w2v.index2entity[i]
                vec = w2v.vectors[i]
                vocab = w2v.vocab[word]
                vec_norm = w2v.vectors_norm[i]
                if word in restricted_word_set:
                vocab.index = len(new_index2entity)
                new_index2entity.append(word)
                new_vocab[word] = vocab
                new_vectors.append(vec)
                new_vectors_norm.append(vec_norm)

                w2v.vocab = new_vocab
                w2v.vectors = new_vectors
                w2v.index2entity = new_index2entity
                w2v.index2word = new_index2entity
                w2v.vectors_norm = new_vectors_norm


                It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.



                Usage:



                w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
                w2v.most_similar("beer")



                [('beers', 0.8409687876701355),

                ('lager', 0.7733745574951172),

                ('Beer', 0.71753990650177),

                ('drinks', 0.668931245803833),

                ('lagers', 0.6570086479187012),

                ('Yuengling_Lager', 0.655455470085144),

                ('microbrew', 0.6534324884414673),

                ('Brooklyn_Lager', 0.6501551866531372),

                ('suds', 0.6497018337249756),

                ('brewed_beer', 0.6490240097045898)]




                restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
                restrict_w2v(w2v, restricted_word_set)
                w2v.most_similar("beer")



                [('lagers', 0.6570085287094116),

                ('wine', 0.6217695474624634),

                ('bash', 0.20583480596542358),

                ('computer', 0.06677375733852386),

                ('python', 0.005948573350906372)]







                share|improve this answer








                New contributor




                zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.























                  0












                  0








                  0







                  I wrote a function which removes the word from KeyedVectors which are not in a predefined word list.



                  def restrict_w2v(w2v, restricted_word_set):
                  new_vectors =
                  new_vocab = {}
                  new_index2entity =
                  new_vectors_norm =

                  for i in range(len(w2v.vocab)):
                  word = w2v.index2entity[i]
                  vec = w2v.vectors[i]
                  vocab = w2v.vocab[word]
                  vec_norm = w2v.vectors_norm[i]
                  if word in restricted_word_set:
                  vocab.index = len(new_index2entity)
                  new_index2entity.append(word)
                  new_vocab[word] = vocab
                  new_vectors.append(vec)
                  new_vectors_norm.append(vec_norm)

                  w2v.vocab = new_vocab
                  w2v.vectors = new_vectors
                  w2v.index2entity = new_index2entity
                  w2v.index2word = new_index2entity
                  w2v.vectors_norm = new_vectors_norm


                  It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.



                  Usage:



                  w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
                  w2v.most_similar("beer")



                  [('beers', 0.8409687876701355),

                  ('lager', 0.7733745574951172),

                  ('Beer', 0.71753990650177),

                  ('drinks', 0.668931245803833),

                  ('lagers', 0.6570086479187012),

                  ('Yuengling_Lager', 0.655455470085144),

                  ('microbrew', 0.6534324884414673),

                  ('Brooklyn_Lager', 0.6501551866531372),

                  ('suds', 0.6497018337249756),

                  ('brewed_beer', 0.6490240097045898)]




                  restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
                  restrict_w2v(w2v, restricted_word_set)
                  w2v.most_similar("beer")



                  [('lagers', 0.6570085287094116),

                  ('wine', 0.6217695474624634),

                  ('bash', 0.20583480596542358),

                  ('computer', 0.06677375733852386),

                  ('python', 0.005948573350906372)]







                  share|improve this answer








                  New contributor




                  zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.










                  I wrote a function which removes the word from KeyedVectors which are not in a predefined word list.



                  def restrict_w2v(w2v, restricted_word_set):
                  new_vectors =
                  new_vocab = {}
                  new_index2entity =
                  new_vectors_norm =

                  for i in range(len(w2v.vocab)):
                  word = w2v.index2entity[i]
                  vec = w2v.vectors[i]
                  vocab = w2v.vocab[word]
                  vec_norm = w2v.vectors_norm[i]
                  if word in restricted_word_set:
                  vocab.index = len(new_index2entity)
                  new_index2entity.append(word)
                  new_vocab[word] = vocab
                  new_vectors.append(vec)
                  new_vectors_norm.append(vec_norm)

                  w2v.vocab = new_vocab
                  w2v.vectors = new_vectors
                  w2v.index2entity = new_index2entity
                  w2v.index2word = new_index2entity
                  w2v.vectors_norm = new_vectors_norm


                  It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.



                  Usage:



                  w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
                  w2v.most_similar("beer")



                  [('beers', 0.8409687876701355),

                  ('lager', 0.7733745574951172),

                  ('Beer', 0.71753990650177),

                  ('drinks', 0.668931245803833),

                  ('lagers', 0.6570086479187012),

                  ('Yuengling_Lager', 0.655455470085144),

                  ('microbrew', 0.6534324884414673),

                  ('Brooklyn_Lager', 0.6501551866531372),

                  ('suds', 0.6497018337249756),

                  ('brewed_beer', 0.6490240097045898)]




                  restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
                  restrict_w2v(w2v, restricted_word_set)
                  w2v.most_similar("beer")



                  [('lagers', 0.6570085287094116),

                  ('wine', 0.6217695474624634),

                  ('bash', 0.20583480596542358),

                  ('computer', 0.06677375733852386),

                  ('python', 0.005948573350906372)]








                  share|improve this answer








                  New contributor




                  zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer






                  New contributor




                  zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered Jan 18 at 17:42









                  zsozsozsozso

                  1




                  1




                  New contributor




                  zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  zsozso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48941648%2fhow-to-remove-a-word-completely-from-a-word2vec-model-in-gensim%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How fix org.hibernate.TransientPropertyValueException

                      Updating UILabel text programmatically using a function

                      Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage