I have a df with 1 column containing millions of rows. I have 70 elements in fruit_list. I want to check if only_col contains an item from fruit_list (partial match). Given str.contains
with regex=True slows down performance, I need the most efficient way. I am unsure if converting to series would make this faster.
fruit_list= ["app","appl", "banana"...,"pear"]
df only_col
apple
banana
cherry
Based on How to make pandas dataframe str.contains search faster, I am considering using a list, but unsure how:
col_list= df.only_col.tolist()
df["str_found"] = ["type" in n for n in set(fruit_list).intersection(set(col_list))]
Expected Output
df only_col inlist str_found
0 apple True app, appl
1 banana True banana
2 cherry False NaN (cherry is not in fruit_list)
CodePudding user response:
Fun question. Here's my solution: the thinking is that since the fruit_list
is tractable, we loop through it while performing fast, vectorised operations on the big Series. I used str.contains(regex=False)
. At the end we just aggregate the matches and wrangle them into one series.
df = pd.DataFrame({'only_col': ['apple', 'banana', 'cherry']})
fruit_list= ["app","appl", "banana","pear"]
is_match_series = np.full(len(df), False)
for fruit in fruit_list:
is_match = df.only_col.str.contains(fruit, regex=False)
is_match_series = is_match_series | is_match # logical or
df[fruit] = np.where(is_match, fruit, np.nan)
df['inlist'] = is_match_series
df['matches'] = (df
.filter(fruit_list)
.replace('nan', '')
.T
.agg(','.join)
.str.lstrip(',')
.str.rstrip(','))
df = df.drop(columns=fruit_list)
Output:
only_col inlist matches
0 apple True app,appl
1 banana True banana
2 cherry False
CodePudding user response:
You can try this:
import numpy as np
import pandas as pd
fruit_list= ["app", "appl", "banana", "pear"]
fruits = ['apple', 'banana', 'cherry']
df = pd.DataFrame({'only_col' : fruits})
print(df)
only_col
0 apple
1 banana
2 cherry
df['str_found'] = df.only_col.map(lambda x: ', '.join([f for f in fruit_list if f in x]))
df.str_found = df.str_found.replace('', np.nan)
df.insert(1, 'inlist', df.str_found.notna())
print(df)
only_col inlist str_found
0 apple True app, appl
1 banana True banana
2 cherry False NaN