Remove rows when column values already present as an element of a list in another column-CodePudding

I would like to remove the rows entirely when the column values of a specific column like user is already present as an element of a list in another column. How can I best accommpish this? Thank you.


    user          friend
0   jack         [mary, jane, alex]
1   mary         [kate, andrew, jensen]
2   alice        [marina, catherine, howard]
3   andrew       [syp, yuslina, john ] 
4   catherine    [yute, kelvin]
5   john         [beyond, holand]

Expected Output

    user          friend
0   jack         [mary, jane, alex]
1   alice        [marina, catherine, howard]
2   andrew       [syp, yuslina, john ]

Explanation =>

Row with mary as the user is removed because mary is present in the list prior.
Row with catherine as the user is removed because catherine is present in the list prior.
Row with john as the user is removed because john is present in the list prior.

CodePudding user response：

You can convert the desired column to one list without any nested list. For this purpose you can use itertools.chain.from_iterable then you can use pandas.isin.

(andrew exists in the [kate, andrew, jensen] so this solution don't show this row too.)

import itertools
df = df[~df['user'].isin(list(itertools.chain.from_iterable(df['friend'])))]

Output:

    user                       friend
0   jack           [mary, jane, alex]
2  alice  [marina, catherine, howard]

CodePudding user response：

Your example seems incorrect, as either john should be kept (blacklist is made of all previous friends), or andrew should be removed (blacklist is only the previous list of friends).

Here are different options.

Remove is the used is present in:

any set of friends

S = set().union(*df['friend'])

mask = ~df['user'].isin(S)
# [False, True, False, True, True, True]

df[mask]

output:

    user                       friend
0   jack           [mary, jane, alex]
2  alice  [marina, catherine, howard]

all previous sets of friends

You can first compute an expanding set of friends, then check whether each user is in the set:

S = set()
# line below uses python ≥ 3.8, if older version use a classical loop
sets = [(S:=S.union(set(x))) for x in df['friend']]

mask = [u not in s for u,s in zip(df['user'], sets)]
# [True, False, True, False, False, False]
out = df[mask]

output:

    user                       friend
0   jack           [mary, jane, alex]
2  alice  [marina, catherine, howard]

only previous set of friends

mask = [u not in s for u,s in zip(df['user'], df['friend'].agg(set).shift(fill_value={}))]
# [True, False, True, True, True, True]

out = df[mask]

output:

        user                       friend
0       jack           [mary, jane, alex]
2      alice  [marina, catherine, howard]
3     andrew         [syp, yuslina, john]
4  catherine               [yute, kelvin]
5       john             [beyond, holand]

used input:

d = {'user': ['jack', 'mary', 'alice', 'andrew', 'catherine', 'john'],
     'friend': [['mary', 'jane', 'alex'], 
                ['kate', 'andrew', 'jensen'],
                ['marina', 'catherine', 'howard'],
                ['syp', 'yuslina', 'john'],
                ['yute', 'kelvin'],
                ['beyond', 'holand']]}
df = pd.DataFrame(d)