Skip to content

Commit 9071cca

Browse files
authored
[FEATURE] Fix OOV in word2vec (#105)
* [FEATURE] Fix oov in word2vec * Update CHANGE.txt
1 parent 376d23c commit 9071cca

File tree

3 files changed

+12
-1
lines changed

3 files changed

+12
-1
lines changed

CHANGE.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
v0.0.7:
2+
1. add BERT and pretrained model (luna_bert)
3+
2. speed up the process in sif
4+
3. handling OOV in word2vec
5+
4. add English tutorials
6+
5. add api docs and prettify tutorials
7+
6. fix the np.error in gensim_vec.W2V.infer_vector
8+
7. fix the parameters lost in tokenization
9+
110
v0.0.6:
211
1. dev: add half-pretrained rnn model
312
2. important!!!: rename TextTokenizer to PureTextTokenizer, and add a new tokenizer named TextTokenizer (the two have similar but not the same behaviours).

EduNLP/Vector/gensim_vec.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,8 @@ def __call__(self, *words):
6262
yield self[word]
6363

6464
def __getitem__(self, item):
65-
return self.wv[item] if item not in self.constants else np.zeros((self.vector_size,))
65+
index = self.key_to_index(item)
66+
return self.wv[item] if index not in self.constants.values() else np.zeros((self.vector_size,))
6667

6768
def infer_vector(self, items, agg="mean", *args, **kwargs) -> np.ndarray:
6869
token_vectors = self.infer_tokens(items, *args, **kwargs)

tests/test_vec/test_vec.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ def test_w2v(stem_tokens, tmpdir, method, binary):
8686
assert w2v.vectors.shape == (len(w2v.wv.vectors) + len(w2v.constants), w2v.vector_size)
8787
assert w2v.key_to_index("[UNK]") == 0
8888
assert w2v.key_to_index("OOV") == 0
89+
assert np.array_equal(w2v["OOV"], np.zeros((10,)))
8990

9091
t2v = T2V("w2v", filepath=filepath, method=method, binary=binary)
9192
assert len(t2v(stem_tokens[:1])[0]) == t2v.vector_size

0 commit comments

Comments
 (0)