Home > other >  Str contain using lists
Str contain using lists

Time:09-06

I have a df with 1 column containing millions of rows. I have 70 elements in fruit_list. I want to check if only_col contains an item from fruit_list (partial match). Given str.contains with regex=True slows down performance, I need the most efficient way. I am unsure if converting to series would make this faster.

fruit_list= ["app","appl", "banana"...,"pear"]
df  only_col
    apple
    banana
    cherry

Based on How to make pandas dataframe str.contains search faster, I am considering using a list, but unsure how:

col_list= df.only_col.tolist()
df["str_found"] = ["type" in n for n in set(fruit_list).intersection(set(col_list))]

Expected Output

df  only_col  inlist      str_found
0    apple     True       app, appl     
1    banana    True       banana
2    cherry    False       NaN           (cherry is not in fruit_list)

CodePudding user response:

Fun question. Here's my solution: the thinking is that since the fruit_list is tractable, we loop through it while performing fast, vectorised operations on the big Series. I used str.contains(regex=False). At the end we just aggregate the matches and wrangle them into one series.

df = pd.DataFrame({'only_col': ['apple', 'banana', 'cherry']})
fruit_list= ["app","appl", "banana","pear"]

is_match_series = np.full(len(df), False)

for fruit in fruit_list:
    is_match = df.only_col.str.contains(fruit, regex=False)
    is_match_series = is_match_series | is_match  # logical or
    df[fruit] = np.where(is_match, fruit, np.nan)

df['inlist'] = is_match_series
df['matches'] = (df
                .filter(fruit_list)
                .replace('nan', '')
                .T
                .agg(','.join)
                .str.lstrip(',')
                .str.rstrip(','))

df = df.drop(columns=fruit_list)

Output:

    only_col    inlist  matches
0   apple       True    app,appl
1   banana      True    banana
2   cherry      False   

CodePudding user response:

You can try this:

import numpy as np
import pandas as pd

fruit_list= ["app", "appl", "banana", "pear"]
fruits = ['apple', 'banana', 'cherry']
df = pd.DataFrame({'only_col' : fruits})

print(df)

  only_col
0    apple
1   banana
2   cherry

df['str_found'] = df.only_col.map(lambda x: ', '.join([f for f in fruit_list if f in x]))
df.str_found = df.str_found.replace('', np.nan)
df.insert(1, 'inlist', df.str_found.notna())

print(df)

  only_col  inlist  str_found
0    apple    True  app, appl
1   banana    True     banana
2   cherry   False        NaN
  • Related