I have a df containing a list in one column and I want to check if the column value contains at least one element from each sub-list in a list of list.
Example:
import pandas as pd
lstOflst = [['a', '5', '3', 'x'], ['1', 'e'], ['g', '7','x']]
data = [
['a', 'b', ['1','2','a'], '2d'],
['d', 'c', ['2','3','e'], '3b'],
['a', 'e', ['x','g','1'], '6a']
]
cols = ['A', 'B', 'C', 'D']
df = pd.DataFrame(data, columns=cols)
I tried:
lst_set = set(lstOflst)
result = df[[len(lst_set.intersection(l.split()))>=1 for l in df['C']]]
It did not work out well throwing an error but I know its wrong. I am just not sure what to do next. This code was adopted from here: Find rows in dataframe that must contain at least 2 elements from a list but it doesn't exactly help me.
Expected resulting dataframe should be:
['a', 'e', ['x','g','1'], '6a']
The result is because its column C values contains at least n = 1 element from each sub-list in my lstOflst. n can be any number not just 1
CodePudding user response:
You can use:
# transform the list to sets, once, for efficiency
sets = [set(l) for l in lstOflst]
# [{'3', '5', 'a', 'x'}, {'1', 'e'}, {'7', 'g', 'x'}]
# for each list in "C", is there a non-null intersection for
# all the sets in "sets"?
mask = [all(S.intersection(l) for S in sets) for l in df['C']]
# [False, False, True]
df[mask]
output:
A B C D
2 a e [x, g, 1] 6a
NB. as you want at least 1 element, there is no need to explicitly check for the length as any non-null intersection will be truthy. If you wanted an intersection of at least 2 items with each set however, you would need:
N = 2
mask = [all(len(S.intersection(l))>=N for S in sets) for l in df['C']]
variant (see comments)
At least N
items in common for any M
sets:
mask = [sum(len(S.intersection(l))>=N for S in sets)>=M for l in df['C']]