Home > Software engineering >  Check df column containing list of values for the presence of at least n elements from each sublist
Check df column containing list of values for the presence of at least n elements from each sublist

Time:06-29

I have a df containing a list in one column and I want to check if the column value contains at least one element from each sub-list in a list of list.

Example:

import pandas as pd
lstOflst = [['a', '5', '3', 'x'], ['1', 'e'], ['g', '7','x']]
data = [
    ['a', 'b', ['1','2','a'], '2d'],
    ['d', 'c', ['2','3','e'], '3b'],
    ['a', 'e', ['x','g','1'], '6a']
]
cols = ['A', 'B', 'C', 'D']
df = pd.DataFrame(data, columns=cols)

I tried:

lst_set = set(lstOflst)
result = df[[len(lst_set.intersection(l.split()))>=1 for l in df['C']]]

It did not work out well throwing an error but I know its wrong. I am just not sure what to do next. This code was adopted from here: Find rows in dataframe that must contain at least 2 elements from a list but it doesn't exactly help me.

Expected resulting dataframe should be: ['a', 'e', ['x','g','1'], '6a'] The result is because its column C values contains at least n = 1 element from each sub-list in my lstOflst. n can be any number not just 1

CodePudding user response:

You can use:

# transform the list to sets, once, for efficiency
sets = [set(l) for l in lstOflst]
# [{'3', '5', 'a', 'x'}, {'1', 'e'}, {'7', 'g', 'x'}]

# for each list in "C", is there a non-null intersection for
# all the sets in "sets"?
mask = [all(S.intersection(l) for S in sets) for l in df['C']]
# [False, False, True]

df[mask]

output:

   A  B          C   D
2  a  e  [x, g, 1]  6a

NB. as you want at least 1 element, there is no need to explicitly check for the length as any non-null intersection will be truthy. If you wanted an intersection of at least 2 items with each set however, you would need:

N = 2
mask = [all(len(S.intersection(l))>=N for S in sets) for l in df['C']]

variant (see comments)

At least N items in common for any M sets:

mask = [sum(len(S.intersection(l))>=N for S in sets)>=M for l in df['C']]
  • Related