I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now
DataFrame
No Text
1 I love you forever...*500 other words
2 No , i know that you know xxx *100 words
My word list
wordlist =['i','love','David','Mary',......]
My code to count word
for i in wordlist :
df[i] = df['Text].str.count(i)
Result :
No Text I love other_words
1 I love you ... 1 1 4
2 No, i know ... 1 0 5
CodePudding user response:
Try this algorithm
https://en.wikipedia.org/wiki/Aho–Corasick_algorithm
you also can search for ready realisations like
https://github.com/Guangyi-Z/py-aho-corasick
CodePudding user response:
You can do this by making a Counter
from the words in each Text
value, then converting that into columns (using pd.Series
), summing the columns that don't exist in wordlist
into other_words
and then dropping those columns:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z] \b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1)
df = df.drop(other_words, axis=1)
Output (for the sample data in your question):
No Text i love other_words
0 1 I love you forever... other words 1 1 4
1 2 No , i know that you know xxx words 1 0 7
Note:
- I've converted all the words to lower-case so you're not counting
I
andi
separately. - I've used
re.findall
rather than the more obvioussplit()
so thatforever...
gets counted as the wordforever
rather thanforever...