I need to check if word in list of word exist in list of strings and how many times it occurs and re-CodePudding

I have already done this part but it is way to slow how can i improve this for 1405000 words and 25000 strings in other list my code is below

data in positiveReview = [' I love this movie', 'overall movie was great, love it',...] data in listOfWords = ['I', 'how', particular', 'love', 'movie' ... ]

positiveReviews = reviews[labels == 'positive'].str.lower()
negativeReviews = reviews[labels == 'negative'].str.lower()
countsForPositive = {}
countsForNegative = {}
for word in listOfWords:    
   countsForPositive.update({word: positiveReviews.str.contains(word).sum()})
   countsForNegative.update({word: negativeReviews.str.contains(word).sum()})

after code i'm expecting to get dict containing all words which are in list of strings and total number of occurrence of word in all list of strings i.e print(positiveReview) should be like {I: 1, love:2, movie: 2}

the code is working as it should be but it takes too long for large number of words and list of strings which is 5000

CodePudding user response：

Instead of looping through words in the 1405000 long words list for each string every time, I recommend below steps:

Create one dictionary on word count for each string
Merge these dictionaries into one summary dictionary
Filter the final summary dictionary with the 1405000 long words list

It should be much more efficient on looping the word list, considering a normal review string contains 100 different words, and we use the word list only once.

A simple and short example would be like below.

wordList = ["word1", "word2", "word3", "word4", "word5"]
reviews = ["word1 word1 word2 word2", "word1 word3 word2 word6" ]
# step 1 and 2
summaryDict = {}
for review in reviews:
    for word in review.split():
        if word in summaryDict:
            summaryDict[word]  = 1
        else:
            summaryDict[word] = 1
# step3
filteredDict = {k: v for k, v in summaryDict.items() if k in wordList}

print(filteredDict)