I have a large list of sentences (~5 million) and a reduced list of key-words (~100 words).
I need to know for each key-word, which senteces contain it. Note, a sentence may any number of key-words (including non at all).
Using conventional pythonic commands takes way too long. I need to boost performance significantly. Any recommendations?
My current code:
# df is a dataframe with all of the sentences
context = list()
for w in keywords:
xdf = df['sentence'].str.contains(w)
xdf = df[xdf]
context.append(xdf.values.tolist())
It is better than a double loop, but still too slow.
CodePudding user response:
Approach
Uses the Aho-Corasick algorithm which allows the processing of several keywords in parallel.
- Uses 200 of the most popular English words as keywords from deekayen/1-1000.txt
- Generate Markov random sentences from Essential Generators module
- Aho-Corasick module from pyahocorasick
Code
import string
import pandas as pd
import ahocorasick as ahc # Word search using Aho-Corasick
from essential_generators import DocumentGenerator # To generate random sentences
# Helper Functions
def make_aho_automaton(keywords):
'''
Creates the automation engine
'''
A = ahc.Automaton() # initialize
for (key, cat) in keywords:
A.add_word(key, (cat, key)) # add keys and categories
A.make_automaton() # generate automaton
return A
def find_keywords(line, A):
'''
Finds line in automation
'''
found_keywords = []
for end_index, (cat, keyw) in A.iter(line):
start_index = end_index - len(keyw)
found_keywords.append(keyw)
return found_keywords
def pre_process_sent(s, make_trans = str.maketrans('', '', string.punctuation)):
'''
make lower case
remove punctuation
surrond each sentence with whitespace for Aho-Corasick
'''
return f' {s.translate(make_trans).lower()} '
Testing
# 1. Generate Test Dataframe with random sentences
# Create Dataframe DataFrame with Random Sentences
# Document generator
gen = DocumentGenerator()
# Place sentences in Dataframe without puntuation and surrounded by spaces
df = pd.DataFrame({'sentence':[pre_process_sent(gen.sentence()) for _ in range(100000)]})
# 2. Use 200 of most common English words as keywords(source: https://gist.github.com/deekayen/4148741)
with open('1-1000.txt', 'r') as f:
# Most 1K popular English words
keywords = [line.rstrip() for line in f]
# Use top 100 keywords
keywords = keywords[:100]
# Generate tuple of keyword, categories for Aho-Corasick
# (surround with white space to make boundary insensitive)
keywords_cat = [(f' {w} ', 1) for w in keywords]
# Generate Automaton
A = make_aho_automaton(keywords_cat)
# Check Dataframe Column for match
df['match'] = df.sentence.apply(lambda x: find_keywords(x, A))
print(df)
Output
Columns
- sentence : rows of sentences
- match : list of keywords that are in each sentence
df
sentence match
0 rare instances foreign affairs the organisati... [ the , of , the ]
1 unangam idiom the emperor of these based on t... [ the , of , these , on , the , from , to ]
2 to dim applied psychology the iaap is conside... [ to , the , is , to , be , a , that ]
3 the females tests take a biopsy or prescribe ... [ the , a , or ]
4 evaporate there linear meters [ there ]
... ... ...
99995 john 1973 discover only []
99996 does it laughter contrary to a series of para... [ it , to , a , of , or ]
99997 contemporary virginia into chiefdoms []
99998 repercussions of site of the acm [ of , of , the ]
99999 history hlabor island followed by a [ by , a ]
100000 rows × 2 columns
Performance
Summary
- Aho-Corasick ~25X faster on 100,000 sentences using 100 keywords
Using Aho-Corasick
%timeit df.sentence.apply(lambda x: find_keywords(x, A))
Result: 292 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using Posted Code
%%timeit
context = list() for w in keywords: xdf = df['sentence'].str.contains(w) xdf = df[xdf] context.append(xdf.values.tolist())
Result: 7.28 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
If all you do in python is to slow, maybe (on a linux/mac) use grep
or ag
(the silversearcher) instead.