Home > Software engineering >  Python too slow to find text in string in for loop
Python too slow to find text in string in for loop

Time:10-22

I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now

DataFrame

No                  Text   
1     I love you forever...*500 other words
2     No , i know that you know xxx *100 words

My word list

wordlist =['i','love','David','Mary',......]

My code to count word

for i in wordlist :
    df[i] = df['Text].str.count(i)

Result :

No   Text                  I    love  other_words
 1    I love you ...       1      1      4
 2    No, i know ...       1      0      5  

CodePudding user response:

Try this algorithm

https://en.wikipedia.org/wiki/Aho–Corasick_algorithm

you also can search for ready realisations like

https://github.com/Guangyi-Z/py-aho-corasick

CodePudding user response:

You can do this by making a Counter from the words in each Text value, then converting that into columns (using pd.Series), summing the columns that don't exist in wordlist into other_words and then dropping those columns:

wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z] \b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1) 
df = df.drop(other_words, axis=1)

Output (for the sample data in your question):

   No                                 Text  i  love  other_words
0   1    I love you forever... other words  1     1            4
1   2  No , i know that you know xxx words  1     0            7

Note:

  • I've converted all the words to lower-case so you're not counting I and i separately.
  • I've used re.findall rather than the more obvious split() so that forever... gets counted as the word forever rather than forever...
  • Related