How to remove rows from a dataframe if the value of a cell is a subset of another cell in the same c-CodePudding

I have a dataframe like this (note: C is the count of values in A):

A                     B   C
ex, one, two, three   X1  4
ex, one, two          X2  3
one, two, four        X3  3
ex, three             X4  2
four, ex              X5  2
four, one             X6  2

I want to delete all rows in which the value in column A is a subset of another value in column A. So the result should look like

A                     B   C
ex, one, two, three   X1  4
one, two, four        X3  3
four, ex              X5  2

I tried the following approach:

dataframe[np.sum(np.array([[y in x for x in dataframe.A.values] for y in dataframe.A.values]),1)==1]

However, this only filters out values that have the same order, e.g. "ex, one, two" and "ex, one, two, three" but not "ex, two, three".

Looking for help :)

CodePudding user response：

IIUC, you could convert your column to sets, then, assuming prior sorting by decreasing C, you can check for each set if it is the subset of any prior set. Finally, slice the dataframe:

l = [set(s.split(', ')) for s in df['A']]
# [{'ex', 'one', 'three', 'two'},
#  {'ex', 'one', 'two'},
#  {'four', 'one', 'two'},
#  {'ex', 'three'},
#  {'ex', 'four'},
#  {'four', 'one'}]

notsub = [not any(s.issubset(x) for x in l[:i]) for i,s in enumerate(l)]
# [True, False, True, False, True, False]

df2 = df[notsub]

output:

                     A   B  C
0  ex, one, two, three  X1  4
2       one, two, four  X3  3
4             four, ex  X5  2

if possible duplicated sets

In this case, you also need to compare with the sets of equal size, the best is to use frozenset and pandas.Series.duplicated:

l = [frozenset(s.split(', ')) for s in df['A']]
notsub = [not any(s.issubset(x) for x in l[:i]) for i,s in enumerate(l)]

df2 = df[notsub & ~pd.Series(l).duplicated()]