Doc2Vec results not as expected-CodePudding

I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.

    model = Doc2Vec(vector_size=25,
                    alpha=0.025,
                    min_alpha=0.00025,
                    min_count=1,
                    dm=1)

I then looped through the corpus to find similar documents for each document. Results were not very good (in comparison to TF-IDF). Note this is after testing different epochs and vector sizes.

    inferred_vector = model.infer_vector(row['cleaned'].split())
    sims = model.docvecs.most_similar([inferred_vector], topn=4)

I also tried extracting the trained vectors and using cosine_similarty, but results were oddly even worse.

cosine_similarities = cosine_similarity(model.docvecs.vectors_docs, model.docvecs.vectors_docs)

Am I doing something wrong or is the smallish corpus the problem?

Edit: Prep and training code

def unesc(s):
    for idx, row in s.iteritems():
        s[idx] = html.unescape(row)
    return s

custom_pipeline = [
    preprocessing.lowercase,
    unesc,
    preprocessing.remove_urls,
    preprocessing.remove_html_tags,
    preprocessing.remove_diacritics,
    preprocessing.remove_digits,
    preprocessing.remove_punctuation,
    lambda s: hero.remove_stopwords(s, stopwords=custom_stopwords),
    preprocessing.remove_whitespace,
    preprocessing.tokenize
]

ds['cleaned'] = ds['body'].pipe(hero.clean, pipeline=custom_pipeline)
w2v_total_data = list(ds['cleaned'])
tag_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(w2v_total_data)]

model = Doc2Vec(vector_size=25,
                alpha=0.025,
                min_alpha=0.00025,
                min_count=1,
                epochs=20,
                dm=1)

model.build_vocab(tag_data)
model.train(tag_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save("lmdocs_d2v.model")

CodePudding user response：

Without seeing your training code, there could easily be errors in text prep & training. Many online code examples are bonkers wrong in their Doc2Vec training technique!

Note that min_count=1 is essentially always a bad idea with this sort of algorithm: any example suggesting that was likely from a misguided author.

Is a mere .split() also the only tokenization applied for training? (The inference list-of-tokens should be prepped the same as the training lists-of-tokens.)

How was "not very good" and "oddly even worse" evaluated? For example, did the results seem arbitrary, or in-the-right-direction-but-just-weak?

"8,000 small documents" is a bit on the thin side for a training corpus, but it somewhat depends on "how small" – a few words, a sentence, a few sentences? Moving to smaller vectors, or more training epochs, can sometimes make the best of a smallish training set - but this sort of algorithm works best with lots of data, such that dense 100d-or-more vectors can be trained.