Python: How to filter a Pandas Dataframe by matching any value in a list to Column that contains lis-CodePudding

I have a Pandas Dataframe with a column titled 'Ingredients' which contains lists of ingredients that are relevant to a row.

I have a multi checkbox which creates a list of checked items called 'checked_items'. In this example the three items in the list below have been checked, and all others have not.

I would like to remove all rows that don't have any matches between any of the checked boxes and any of the items in the 'Ingredients' column. Any match between any value between checked_itmes and the list of Ingredients is good enough to keep the row, as in the example that follows:

checked_items=['Carrot', 'Celery', 'Onion']

EXAMPLE:
Col_1  Col_2 Ingredients
"a"    "e"   [Carrot, Ginger, Curry]
"b"    "f"   [Butter, Shallots]
"c"    "g"   [Celery, Onion, Sage, Thyme]

DESIRED RESULT:
EXAMPLE:
Col_1  Col_2 Ingredients
"a"    "e"   [Carrot, Ginger, Curry]
"c"    "g"   [Celery, Onion, Sage, Thyme]

Beware, my attempt below is very amateur. I hacked together a few other Stack Overflow answers to get to this. My mask filter is the correct length, and does filter the dataframe, but doesn't filter the data correctly. Beyond that, there must be a cleaner, nicer way to do this. I assume this method will also become painfully slow as the dataframe grows.


mask=[]
for ingredient_list in df['Ingredients'].to_list():
    
    if not ingredient_list:
        mask.append(False)
        continue

    i=0
    try:
        for ingredient in ingredient_list:
            for checked_item in checked_items:
                if checked_item == ingredient:
                    mask.append(True)
                    raise StopIteration

            i=i 1
            if i==len(categories):
                mask.append(False)

    except StopIteration:
        continue

filtered_df = df[mask]

Thank you so much for your help in advance

CodePudding user response：

You can explode your list of ingredients and check them with isin:

m = df['Ingredients'].explode().isin(checked_items).groupby(level=0).max()
print(df[m])

# Output
  Col_1 Col_2                   Ingredients
0     a     e       [Carrot, Ginger, Curry]
2     c     g  [Celery, Onion, Sage, Thyme]

Step by step:

# Explode each list of ingredients
>>> m = df['Ingredients'].explode()
0      Carrot
0      Ginger
0       Curry
1      Butter
1    Shallots
2      Celery
2       Onion
2        Sage
2       Thyme
Name: Ingredients, dtype: object

# Check ingredients
>>> m = m.isin(checked_items)
0     True
0    False
0    False
1    False
1    False
2     True
2     True
2    False
2    False
Name: Ingredients, dtype: bool

# Group by index and keep the highest value (True > False)
>>> m = m.groupby(level=0).max()
0     True
1    False
2     True
Name: Ingredients, dtype: bool

# Filter out your data
>>> df[m]
  Col_1 Col_2                   Ingredients
0     a     e       [Carrot, Ginger, Curry]
2     c     g  [Celery, Onion, Sage, Thyme]

CodePudding user response：

For improve perfromance use set.isdisjoint and filter in boolean indexing:

df = df[~df['Ingredients'].map(set(checked_items).isdisjoint)]
print (df)
  Col_1 Col_2                   Ingredients
0     a     e       [Carrot, Ginger, Curry]
2     c     g  [Celery, Onion, Sage, Thyme]

Performance: For 3k rows, best test in real data:

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)


In [83]: %timeit df[~df['Ingredients'].map(set(checked_items).isdisjoint)]
839 µs ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Corralien solution - 5.4 times slowier
In [84]: %timeit df[df['Ingredients'].explode().isin(checked_items).groupby(level=0).max()]
4.58 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For 300k rows:

df = pd.concat([df] * 100000, ignore_index=True)


In [87]: %timeit df[~df['Ingredients'].map(set(checked_items).isdisjoint)]
50.4 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [88]: %timeit df[df['Ingredients'].explode().isin(checked_items).groupby(level=0).max()]
398 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

assuming that in your dataframe ingredients are written as '[Carrot, Ginger, Curry]' and not as '["Carrot", "Ginger", "Curry"]', you can use list comperhension and string manipulation to create a mask column

df.loc[:, "Mask_Column"] = df["Ingredients"].apply(lambda i:  any([j for j in [i.strip(" ") for i in string.strip("]").strip("[").split(",")] if j in checked_items])

Then get your dataframe by using

result = df[df["Mask_Column"]]

The strip methods are used to remove spaces and brackets, and split method transforms string to list in order to use list comperhension. Finally, use any() to check if cross of ingredients in row and ingredients in checked_items has any values. Alternatively, you could use bool(), since bool([]) is False, and if list has any elements bool() returns True.