Home > database >  Search column of lists
Search column of lists

Time:10-18

I have a df with a col of a list of strings. I want to create col_I_want of exact matches.

lookfor=["apple", "nectarine"]
      col                              col_I_want
0   ["apple", "banana", "nectarine"]   ["apple", "nectarine"]
1   ["pear", "banana"]                 np.NaN

If I do the below, I get numpy array object has no attribute apply error.

df['col'].apply(lambda x: list(set(x).intersection(lookfor)))

If I do the below, I get pd.Series.__iter__() is not implemented since I'm using Pandas on Spark

ps.series(df['col']).apply(lambda x: list(set(x).intersection(lookfor)))

I could convert the column to string but that would make it harder to find exact matches.

CodePudding user response:

One can do that with pandas.Series.apply and a custom lambda function as follows

import pandas as pd
import numpy as np
    
df['col_I_want'] = df['col'].apply(lambda x: [i for i in x if i.lower() in [j.lower() for j in lookfor]] if len([i for i in x if i.lower() in [j.lower() for j in lookfor]]) > 0 else np.NaN)

[Out]:

                          col          col_I_want
0  [apple, banana, nectarine]  [apple, nectarine]
1              [pear, banana]                 NaN

One can also do it with a list comprehension as follows

df['col_I_want'] = [ [i for i in x if i.lower() in [j.lower() for j in lookfor]] if len([i for i in x if i.lower() in [j.lower() for j in lookfor]]) > 0 else np.NaN for x in df['col'] ]

[Out]:

                          col          col_I_want
0  [apple, banana, nectarine]  [apple, nectarine]
1              [pear, banana]                 NaN

Notes:

  • .lower() is a way to make it case insensitive.

  • The second if is a way to then fill the empty column with a numpy.NaN.

CodePudding user response:

I change solution for custom function for NaN if no match:

df = pd.DataFrame({'col': [['apple', 'banana', 'nectarine'], ['pear', 'banana']]})

lookfor=["apple", "nectarine"]

def f(x):
    y = list(set(x).intersection(lookfor))
    return np.nan if len(y) == 0 else y
    

df['col_I_want'] = df['col'].apply(f)
print (df)
                          col          col_I_want
0  [apple, banana, nectarine]  [nectarine, apple]
1              [pear, banana]                 NaN
  • Related