Good morning,
I have some lexical semantic similarity based on word2vec experiments, but there are some problems with the code, specifically:
First, I want to train all of the file in the specified path (a total of 18 million words), but now my code can specify a file,
The corpus of the second, I have good word segmentation, which separated with a space, word and the word separated with an English sentence, I want word2vec also according to the requirements of the training corpus, the hope bosses give code,
Third, I need to be trained more than 18 million words of corpora, but currently training a corpus of 1 million to 30 minutes, hope bosses give improve code,
Finished fourth, training materials called again, not only want to give training to solve again,
The above said seems to be a lot of, but they are all very basic question, hope bosses give code changes, sincerely thank you, my code as shown below, beast wishes for you!
# - * - coding: utf-8 - * -
The from gensim. Models import word2vec
The from gensim. Models import Word2Vec
The import of logging
The import gensim
# main program
Logging. BasicConfig (format='% (asctime) s: % (levelname) s: % s' (the message), level=logging. The INFO)
Sentences=word2vec. Text8Corpus (u "C: \ \ Users \ \ amgalang \ \ Desktop \ \ Ph D \ \ word vector workbook practice two classes before \ \ \ \ 7. TXT") # loading corpora
Model=word2vec. Word2vec (sentences, sg=0, min_count=2, the window=5, size=100) # training skip - "gramm model, the default window=5
Model. The save (" text2. Model ") # model save address
Example # training
Y2=model. Most_similar (u "BI", topn=20)
Print (u "and compound AH_A=DEGUU semantically related words sort:")
For the item in y2:
Print (item [0], item [1])
Print (" -- -- -- -- - \ n ")