I have a problem. I want to check whether a certain regex occurs in a text (This regex will become more complex later.). Unfortunately, my code snippet runs, but it takes a long time. How could I rewrite the code to make it faster and more efficient?
If the element is present in the text, the code number of the respective element should be found and written into a new column. If it is not present, 999
should be written
Dataframe
customerId text element code
0 1 Something with Cat cat 0
1 3 That is a huge dog dog 1
2 3 Hello agian mouse 2
Code snippet
import pandas as pd
import copy
import re
d = {
"customerId": [1, 3, 3],
"text": ["Something with Cat", "That is a huge dog", "Hello agian"],
"element": ['cat', 'dog', 'mouse']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
print(df)
def f(x):
match = 999
for element in df['element'].unique():
check = bool(re.search(element, x['text'], re.IGNORECASE))
if(check):
#print(forwarder)
match = df['code'].loc[df['element']== element].iloc[0]
break
x['test'] = match
return x
#print(match)
df['test'] = None
df = df.apply(lambda x: f(x), axis = 1)
Intended output
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge dog dog 1 1
2 3 Hello agian mouse 2 999
CodePudding user response:
You can use pandas.str.contains
then use numpy.where
to fill with df['code']
and 999
.
import numpy as np
mask = df['text'].str.contains('|'.join(df['element']), case=False)
df['test'] = np.where(mask, df['code'], 999)
print(df)
But if you want to get the output for "text": ["Something with Dog", "That is a huge Cat", "Hello agian"]
as [1,0,999]
. You can create dict
with element
and code
. If element
with regex search exist use code
value in Dict
or replace 999
.
import re
dct = dict(zip(df['element'].str.lower(), df['code']))
pattern = re.compile("|".join(dct.keys()), re.IGNORECASE)
df['test'] = df['text'].apply(lambda x: dct[pattern.search(x).group(0).lower()] if pattern.search(x) else 999)
print(df)
Output:
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge Dog dog 1 1
2 3 Hello agian mouse 2 999
CodePudding user response:
You can use pandas apply
to iterate through all the text
and check if element
exist in text
. Here is one of the solution using numpy
import numpy as np
d = {
"customerId": [1, 3, 3],
"text": ["Something with Cat", "That is a huge dog", "Hello agian"],
"element": ['cat', 'dog', 'mouse']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
df['test'] = np.where(df.apply(lambda x: x.element.lower() in x.text.lower(), axis=1), df['code'], 999)
Output :
df
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge dog dog 1 1
2 3 Hello agian mouse 2 999
You can also do the same thing in lamda
function using df.apply
df['test'] = df.apply(lambda x: x.code if x.element.lower() in x.text.lower() else 999, axis=1)
This gives us the same thing
df
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge dog dog 1 1
2 3 Hello agian mouse 2 999
CodePudding user response:
Use the already written functions of pandas to make it a lot faster.
# ...
for element in df['element'].unique():
matching = df[df['text'].str.match(element) == True]
# ...
The matching
variable contains all the rows that are matching with the given regex code (element
).
Also, you can read more about pandas and regex in this excellent site: https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/.