How to speed up computing sentence similarity using spacy in Python?-CodePudding

I have the following code which takes in 2 sentences and return the similarity:

nlp = spacy.load("en_core_web_md/en_core_web_md-3.2.0")

def get_categories_nlp_sim(cat_1, cat_2):

    if (cat_1 != cat_1) or (cat_2 != cat_2):
        s = np.nan
    else:
        doc1 = nlp(cat_1)
        doc2 = nlp(cat_2)

        s = doc1.similarity(doc2)

    return s

So, this seems to give reasonable results but when using it in a for loop of ~1M rows, it just becomes too slow to use.

Any ideas on how to speed this up? or perhaps another NLP library that could do the same thing faster?

Thanks!

CodePudding user response：

If you truly have 1m rows and compare each of them as pairs you would have an astronomical number of comparisons. SpaCys nlp() does a whole lot other than just the stuff needed for the similarity.

What SpaCys similarity() does is use the processed documents vector and calculate a cosine similarity (document vector = average over word vectors), check out source code.

So the probably most efficient possibly way for you to replicate a similarity for this many pairs would be to get a semantic token representation vector for each unique token in the entire corpus using something like Gensims pretrained word2vec model, then for each row calculate the average of the vectors of the tokens in it and then once you have those 1m document vectors as numpy arrays you calculate the cosine similarities using numpy or scipy which is drastically faster than pure Python.

Also check out this thread which is a similar question to yours: Efficient way for Computing the Similarity of Multiple Documents using Spacy

I'm not sure what exactly your main goal is in your code but I am pretty sure that calculating each pairwise similarity is not required or at least not the best way to go ahead and reach that goal, so please share more about the context you need this method in.

CodePudding user response：

After going through the answers and this other related thread Efficient way for Computing the Similarity of Multiple Documents using Spacy, I managed to get a significant speed-up.

I am now using the following code:

nlp = spacy.load(en_core_web_md, exclude=["tagger", "parser", "senter", "attribute_ruler", "lemmatizer", "ner"])

processed_docs_1 = nlp.pipe(texts_1)
processed_docs_2 = nlp.pipe(texts_2)

for _ in range(len(texts_1)):

    doc_1 = next(processed_docs_1)
    doc_2 = next(processed_docs_2)

    s = doc_1.similarity(doc_2)

where texts_1 and texts_2 are of the same length consisting of the pairs to compare (e.g. texts_1[i] with texts_2[i]).

Adding the "exclude" in spacy load resulted in ~ 2x speed up. Using nlp.pipe as opposed to calling nlp inside the loop resulted in a ~10x speed up. So combined, I obtain ~20x speed up.