I would like to remove the rows entirely when the column values of a specific column like user
is already present as an element of a list in another column. How can I best accommpish this? Thank you.
user friend
0 jack [mary, jane, alex]
1 mary [kate, andrew, jensen]
2 alice [marina, catherine, howard]
3 andrew [syp, yuslina, john ]
4 catherine [yute, kelvin]
5 john [beyond, holand]
Expected Output
user friend
0 jack [mary, jane, alex]
1 alice [marina, catherine, howard]
2 andrew [syp, yuslina, john ]
Explanation =>
Row with
mary
as theuser
is removed becausemary
is present in the list prior.Row with
catherine
as theuser
is removed becausecatherine
is present in the list prior.Row with
john
as theuser
is removed becausejohn
is present in the list prior.
CodePudding user response:
You can convert the desired column to one list without any nested list. For this purpose you can use itertools.chain.from_iterable
then you can use pandas.isin
.
(andrew
exists in the [kate, andrew, jensen]
so this solution don't show this row too.)
import itertools
df = df[~df['user'].isin(list(itertools.chain.from_iterable(df['friend'])))]
Output:
user friend
0 jack [mary, jane, alex]
2 alice [marina, catherine, howard]
CodePudding user response:
Your example seems incorrect, as either john should be kept (blacklist is made of all previous friends), or andrew should be removed (blacklist is only the previous list of friends).
Here are different options.
Remove is the used is present in:
any set of friends
S = set().union(*df['friend'])
mask = ~df['user'].isin(S)
# [False, True, False, True, True, True]
df[mask]
output:
user friend
0 jack [mary, jane, alex]
2 alice [marina, catherine, howard]
all previous sets of friends
You can first compute an expanding set of friends, then check whether each user is in the set:
S = set()
# line below uses python ≥ 3.8, if older version use a classical loop
sets = [(S:=S.union(set(x))) for x in df['friend']]
mask = [u not in s for u,s in zip(df['user'], sets)]
# [True, False, True, False, False, False]
out = df[mask]
output:
user friend
0 jack [mary, jane, alex]
2 alice [marina, catherine, howard]
only previous set of friends
mask = [u not in s for u,s in zip(df['user'], df['friend'].agg(set).shift(fill_value={}))]
# [True, False, True, True, True, True]
out = df[mask]
output:
user friend
0 jack [mary, jane, alex]
2 alice [marina, catherine, howard]
3 andrew [syp, yuslina, john]
4 catherine [yute, kelvin]
5 john [beyond, holand]
used input:
d = {'user': ['jack', 'mary', 'alice', 'andrew', 'catherine', 'john'],
'friend': [['mary', 'jane', 'alex'],
['kate', 'andrew', 'jensen'],
['marina', 'catherine', 'howard'],
['syp', 'yuslina', 'john'],
['yute', 'kelvin'],
['beyond', 'holand']]}
df = pd.DataFrame(d)