I have a dataframe like this (note: C is the count of values in A):
A B C
ex, one, two, three X1 4
ex, one, two X2 3
one, two, four X3 3
ex, three X4 2
four, ex X5 2
four, one X6 2
I want to delete all rows in which the value in column A is a subset of another value in column A. So the result should look like
A B C
ex, one, two, three X1 4
one, two, four X3 3
four, ex X5 2
I tried the following approach:
dataframe[np.sum(np.array([[y in x for x in dataframe.A.values] for y in dataframe.A.values]),1)==1]
However, this only filters out values that have the same order, e.g. "ex, one, two" and "ex, one, two, three" but not "ex, two, three".
Looking for help :)
CodePudding user response:
IIUC, you could convert your column to sets, then, assuming prior sorting by decreasing C, you can check for each set if it is the subset of any prior set. Finally, slice the dataframe:
l = [set(s.split(', ')) for s in df['A']]
# [{'ex', 'one', 'three', 'two'},
# {'ex', 'one', 'two'},
# {'four', 'one', 'two'},
# {'ex', 'three'},
# {'ex', 'four'},
# {'four', 'one'}]
notsub = [not any(s.issubset(x) for x in l[:i]) for i,s in enumerate(l)]
# [True, False, True, False, True, False]
df2 = df[notsub]
output:
A B C
0 ex, one, two, three X1 4
2 one, two, four X3 3
4 four, ex X5 2
if possible duplicated sets
In this case, you also need to compare with the sets of equal size, the best is to use frozenset
and pandas.Series.duplicated
:
l = [frozenset(s.split(', ')) for s in df['A']]
notsub = [not any(s.issubset(x) for x in l[:i]) for i,s in enumerate(l)]
df2 = df[notsub & ~pd.Series(l).duplicated()]