Make it faster to check if a certain regex is present in the text-CodePudding

I have a problem. I want to check whether a certain regex occurs in a text (This regex will become more complex later.). Unfortunately, my code snippet runs, but it takes a long time. How could I rewrite the code to make it faster and more efficient?

If the element is present in the text, the code number of the respective element should be found and written into a new column. If it is not present, 999 should be written

Dataframe

   customerId                text element  code
0           1  Something with Cat     cat     0
1           3  That is a huge dog     dog     1
2           3         Hello agian   mouse     2

Code snippet

import pandas as pd
import copy
import re
d = {
    "customerId": [1, 3, 3],
    "text": ["Something with Cat", "That is a huge dog", "Hello agian"],
     "element": ['cat', 'dog', 'mouse']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
print(df)

def f(x):
    match = 999
    for element in df['element'].unique():
        check = bool(re.search(element, x['text'], re.IGNORECASE))
        if(check):
            #print(forwarder)
            match = df['code'].loc[df['element']== element].iloc[0]
            break
    x['test'] = match
    return x
    #print(match)
df['test'] = None
df = df.apply(lambda x: f(x), axis = 1)

Intended output

   customerId                text element  code  test
0           1  Something with Cat     cat     0     0
1           3  That is a huge dog     dog     1     1
2           3         Hello agian   mouse     2   999

CodePudding user response：

You can use pandas.str.contains then use numpy.where to fill with df['code'] and 999.

import numpy as np

mask = df['text'].str.contains('|'.join(df['element']), case=False)
df['test'] = np.where(mask, df['code'], 999)
print(df)

But if you want to get the output for "text": ["Something with Dog", "That is a huge Cat", "Hello agian"] as [1,0,999]. You can create dict with element and code. If element with regex search exist use code value in Dict or replace 999.

import re
dct = dict(zip(df['element'].str.lower(), df['code']))
pattern = re.compile("|".join(dct.keys()), re.IGNORECASE)
df['test'] = df['text'].apply(lambda x: dct[pattern.search(x).group(0).lower()] if pattern.search(x) else 999)
print(df)

Output:

   customerId                text element  code  test
0           1  Something with Cat     cat     0     0
1           3  That is a huge Dog     dog     1     1
2           3         Hello agian   mouse     2   999

CodePudding user response：

You can use pandas apply to iterate through all the text and check if element exist in text. Here is one of the solution using numpy

import numpy as np

d = {
    "customerId": [1, 3, 3],
    "text": ["Something with Cat", "That is a huge dog", "Hello agian"],
     "element": ['cat', 'dog', 'mouse']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
df['test'] = np.where(df.apply(lambda x: x.element.lower() in x.text.lower(), axis=1), df['code'], 999)

Output :

df
   customerId                text element  code  test
0           1  Something with Cat     cat     0     0
1           3  That is a huge dog     dog     1     1
2           3         Hello agian   mouse     2   999

You can also do the same thing in lamda function using df.apply

df['test'] = df.apply(lambda x: x.code if x.element.lower() in x.text.lower() else 999, axis=1)

This gives us the same thing

df
   customerId                text element  code  test
0           1  Something with Cat     cat     0     0
1           3  That is a huge dog     dog     1     1
2           3         Hello agian   mouse     2   999

CodePudding user response：

Use the already written functions of pandas to make it a lot faster.

# ...
for element in df['element'].unique():
    matching = df[df['text'].str.match(element) == True]
    # ...

The matching variable contains all the rows that are matching with the given regex code (element).

Also, you can read more about pandas and regex in this excellent site: https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/.