My dataframe has a column of lists and looks like this.
id source
0 3 [nan,nan,nan]
1 5 [nan,foo,foo,nan,foo]
2 7 [ham,nan,ham,nan]
3 9 [foo,foo]
I need to remove duplicates from each list. So I am looking from something like below.
id source
0 3 [nan]
1 5 [nan,foo]
2 7 [ham,nan]
3 9 [foo]
I tried to use the following code which didn't work. What do you recommend?
df['source'] = list(set(df['source']))
CodePudding user response:
You can .explode
on source
column, .drop_duplicates
and .groupby
back:
df = (
df.explode("source")
.drop_duplicates(["id", "source"])
.groupby("id", as_index=False)
.agg(list)
)
print(df)
Prints:
id source
0 3 [nan]
1 5 [nan, foo]
2 7 [ham, nan]
3 9 [foo]
Or convert the list to pd.Series
, drop duplicates and convert back to list:
df["source"] = df["source"].apply(lambda x: [*pd.Series(x).drop_duplicates()])
print(df)