Problem with training Word2Vec after opening csv-CodePudding

I'm trying to train Word2Vec model. When I try to train the model directly from the Series I get everything is fine, but when I save the DataFrame to csv and then open it, I have a problem.

data = pd.read_csv('test.txt', sep='\r\n', names=['input'], engine="python")
data = data.dropna().drop_duplicates()
data = data['input'].iloc[:1000].apply(lemmatize)
print(data)

So I get this Series:

0                       [two, households, alike, dignity]
1                          [in, fair, verona, lay, scene]
2             [from, ancient, grudge, break, new, mutiny]
3       [where, civil, blood, makes, civil, hands, unc...
4                  [from, forth, fatal, loins, two, foes]
                              ...                        
1152    [and, therefore, thou, mayst, think, havior, l...
1153              [but, trust, gentleman, i, prove, true]
1154                              [than, coying, strange]
1155                       [i, strange, i, must, confess]
1156             [but, thou, overheard, st, ere, i, ware]
Name: input, Length: 911, dtype: object

When I try to train Word2Vec directly from this Series, everything is fine:

w2v_model = Word2Vec(
    min_count=10,
    window=2,
    vector_size=300,
    negative=10,
    alpha=0.03,
    min_alpha=0.0007,
    sample=6e-5,
    sg=1)

w2v_model.build_vocab(data)
w2v_model.train(data, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
print(w2v_model.wv.key_to_index)

{'i': 0, 'thou': 1, 'and': 2, 'love': 3, 'romeo': 4, 'shall': 5, 'what': 6, 'fair': 7, 'thee': 8, 'but': 9, 'to': 10, 'thy': 11, 'tis': 12, 'the': 13, 'come': 14, 'o': 15, 'night': 16, 'a': 17, 'sampson': 18, 'that': 19, 'capulet': 20, 'montague': 21, 'servingman': 22, 'good': 23, 'gregory': 24, 'one': 25, 'lady': 26, 'my': 27, 'go': 28, 'he': 29, 'nurse': 30, 'she': 31, 'you': 32, 'let': 33, 'would': 34, 'well': 35, 'say': 36, 'sir': 37, 'this': 38, 'for': 39, 'enter': 40, 'old': 41, 'take': 42, 'ay': 43, 'hath': 44, 'benvolio': 45, 'of': 46, 'tell': 47, 'art': 48, 'mercutio': 49, 'know': 50, 'by': 51, 'light': 52, 'see': 53, 'eyes': 54, 'juliet': 55, 'er': 56, 'must': 57, 'word': 58, 'name': 59, 'man': 60, 'men': 61, 'wilt': 62, 'nay': 63, 'peace': 64, 'much': 65, 'it': 66, 'yet': 67, 'house': 68, 'upon': 69, 'which': 70, 'as': 71, 'doth': 72, 'think': 73, 'in': 74}

But when I try to save the DataFrame to csv and then open it and train Word2Vec, I get this:

data = pd.read_csv('test.txt', sep='\r\n', names=['input'], engine="python")
data = data.dropna().drop_duplicates()
data = data['input'].iloc[:1000].apply(lemmatize)
data = data.dropna()
data.to_csv('test.csv', index=False)

data = pd.read_csv('test.csv')['input']
w2v_model = Word2Vec(
    min_count=10,
    window=2,
    vector_size=300,
    negative=10,
    alpha=0.03,
    min_alpha=0.0007,
    sample=6e-5,
    sg=1)
w2v_model.build_vocab(data)
w2v_model.train(data, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
print(w2v_model.wv.key_to_index)

{"'": 0, ',': 1, ' ': 2, 'e': 3, 'a': 4, 't': 5, 'o': 6, 's': 7, 'r': 8, 'i': 9, 'n': 10, 'h': 11, 'l': 12, ']': 13, '[': 14, 'd': 15, 'u': 16, 'm': 17, 'g': 18, 'c': 19, 'w': 20, 'y': 21, 'p': 22, 'f': 23, 'b': 24, 'v': 25, 'k': 26, 'j': 27, 'q': 28, 'x': 29, 'z': 30}

Series after opening:

0              ['two', 'households', 'alike', 'dignity']
1               ['in', 'fair', 'verona', 'lay', 'scene']
2      ['from', 'ancient', 'grudge', 'break', 'new', ...
3      ['where', 'civil', 'blood', 'makes', 'civil', ...
4      ['from', 'forth', 'fatal', 'loins', 'two', 'fo...
                             ...                        
906    ['and', 'therefore', 'thou', 'mayst', 'think',...
907    ['but', 'trust', 'gentleman', 'i', 'prove', 't...
908                        ['than', 'coying', 'strange']
909             ['i', 'strange', 'i', 'must', 'confess']
910    ['but', 'thou', 'overheard', 'st', 'ere', 'i',...
Name: input, Length: 911, dtype: object

What could be the problem?

CodePudding user response：

First, it'd help to name the variable holding data that's come from a different place different from the original data, for clarity of reference/comparison.

For example, instead of loading your saved data as...

data = pd.read_csv('test.csv')['input']

...give it a distinctive name instead:

data_from_csv = pd.read_csv('test.csv')['input']

Then, check whether your original Series data & later data_from_csv actually look the same, as the exact same type of objects in each item of the iterable corpus, to Word2Vec.

For example, to look at the 1st item in each iterable object, look at:

print(next(iter(data)))

…and also for comparison…

print(next(iter(data_from_csv)))

If these aren't identical, then your second data_from_csv case isn't showing Word2Vec the same type of corpus. In particular, each individual text item in the Word2Vec training corpus should be a Python list of individual string tokens.

If you instead pass it only strings, it will instead see each individual item in that string (single characters) as if they were string tokens, resulting the problem you've described: all the model's known words are single-characters.

(Are you sure you didn't see a WARNING in your logs/console-output to that effect, on the 2nd run? The .build_vocab() step checks if the 1st item in the corpus is a plain-string, rather than the proper list, and prints a warning when it is.)

Make sure to do one or the other of:

(1) write the CSV as space-delimited texts – not Python list literals – then re-.split() into a list after CSV-reading the raw strings; or

(2) write the CSV as Python list literals – eg with brackets and string-quoting – but then also, after loading those literals as raw strings, interpret them as Python objects (as if using an eval()) to turn them back into lists-of-tokens

Then, the corpus you're feeding to Word2Vec will be of the right format to get the intended individual words. (Option (1) above is usually the preferred approach: space-delimited strings are a simpler/safer/faster format to later re-parse/split, with no risk that and eval() could potentially run unintended code.)

(As a totally separate issue: using odd non-default values like alpha=0.03, min_alpha=0.0007 usually indicates you're following a bad tutorial, that's changed these for no good reason. Such a change is unlikely to either hurt or help your results much - it's just an odd & unnecessary choice hinting at random guesswork & unthinking copying rather than true understanding.)