speed-boosting string compare-CodePudding

I have a large list of sentences (~5 million) and a reduced list of key-words (~100 words).

I need to know for each key-word, which senteces contain it. Note, a sentence may any number of key-words (including non at all).

Using conventional pythonic commands takes way too long. I need to boost performance significantly. Any recommendations?

My current code:

# df is a dataframe with all of the sentences
context = list()
for w in keywords:
    xdf = df['sentence'].str.contains(w)
    xdf = df[xdf]
    context.append(xdf.values.tolist())

It is better than a double loop, but still too slow.

CodePudding user response：

Approach

Uses the Aho-Corasick algorithm which allows the processing of several keywords in parallel.

Uses 200 of the most popular English words as keywords from deekayen/1-1000.txt
Generate Markov random sentences from Essential Generators module
Aho-Corasick module from pyahocorasick

Code

import string
import pandas as pd

import ahocorasick as ahc                          # Word search using Aho-Corasick
from essential_generators import DocumentGenerator # To generate random sentences

# Helper Functions
def make_aho_automaton(keywords):
    '''
        Creates the automation engine
    '''
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

def find_keywords(line, A):
    '''
        Finds line  in automation
    '''
    found_keywords = []
    for end_index, (cat, keyw) in A.iter(line):
        start_index = end_index - len(keyw)
        found_keywords.append(keyw)
    return found_keywords

def pre_process_sent(s, make_trans = str.maketrans('', '', string.punctuation)):
    '''
        make lower case
        remove punctuation
        surrond each sentence with whitespace for Aho-Corasick 
    '''
    return f' {s.translate(make_trans).lower()} '

Testing

# 1. Generate Test Dataframe with random sentences
# Create Dataframe DataFrame with Random Sentences
#    Document generator
gen = DocumentGenerator()
#    Place sentences in Dataframe without puntuation and surrounded by spaces
df = pd.DataFrame({'sentence':[pre_process_sent(gen.sentence()) for _ in range(100000)]})

# 2. Use 200 of most common English words as keywords(source: https://gist.github.com/deekayen/4148741)
with open('1-1000.txt', 'r') as f:
    # Most 1K popular English words
    keywords = [line.rstrip() for line in f]
    
    # Use top 100 keywords
    keywords = keywords[:100]
    
    # Generate tuple of keyword, categories for Aho-Corasick 
    # (surround with white space to make boundary insensitive)
    keywords_cat = [(f' {w} ', 1) for w in keywords]


# Generate Automaton
A = make_aho_automaton(keywords_cat)

# Check Dataframe Column for match
df['match'] = df.sentence.apply(lambda x: find_keywords(x, A))
print(df)

Output

Columns

sentence : rows of sentences
match : list of keywords that are in each sentence

sentence    match
0   rare instances foreign affairs the organisati...    [ the , of , the ]
1   unangam idiom the emperor of these based on t...    [ the , of , these , on , the , from , to ]
2   to dim applied psychology the iaap is conside...    [ to , the , is , to , be , a , that ]
3   the females tests take a biopsy or prescribe ...    [ the , a , or ]
4   evaporate there linear meters   [ there ]
... ... ...
99995   john 1973 discover only []
99996   does it laughter contrary to a series of para...    [ it , to , a , of , or ]
99997   contemporary virginia into chiefdoms    []
99998   repercussions of site of the acm    [ of , of , the ]
99999   history hlabor island followed by a [ by , a ]
100000 rows × 2 columns

Performance

Summary

Aho-Corasick ~25X faster on 100,000 sentences using 100 keywords

Using Aho-Corasick

%timeit df.sentence.apply(lambda x: find_keywords(x, A))

Result: 292 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using Posted Code

%%timeit

context = list() for w in keywords: xdf = df['sentence'].str.contains(w) xdf = df[xdf] context.append(xdf.values.tolist())

Result: 7.28 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

If all you do in python is to slow, maybe (on a linux/mac) use grep or ag (the silversearcher) instead.