I have the following pandas dataframe:
import pandas as pd
pd.DataFrame({'id': [1,1,1,1,2,2,2], 'col': ['a','b','c','c','a','b','d']})
id col
0 1 a
1 1 b
2 1 c
3 1 c
4 2 a
5 2 b
6 2 d
I would like to add a new column, which would contain the list of unique values of col
by id
The end dataframe would look like this:
pd.DataFrame({'id': [1,1,1,1,2,2,2], 'col': ['a','b','c','c','a','b','d'],
'col2': [['a','b','c'],['a','b','c'],['a','b','c'],['a','b','c'],
['a','b','d'],['a','b','d'],['a','b','d']]})
id col col2
0 1 a [a, b, c]
1 1 b [a, b, c]
2 1 c [a, b, c]
3 1 c [a, b, c]
4 2 a [a, b, d]
5 2 b [a, b, d]
6 2 d [a, b, d]
How could I do that ?
CodePudding user response:
A possible solution with GroupBy.agg
and reindex
:
df["col2"] = (df.groupby("id")["col"].agg(lambda x: x.unique().tolist())
.reindex(df["id"]).values)
Output :
print(df)
id col col2
0 1 a [a, b, c]
1 1 b [a, b, c]
2 1 c [a, b, c]
3 1 c [a, b, c]
4 2 a [a, b, d]
5 2 b [a, b, d]
6 2 d [a, b, d]
CodePudding user response:
For lists remove duplicated values in order is used dict.fromkeys
trick in lambda function and for new column is used Series.map
:
df['col2'] = df['id'].map(df.groupby('id')['col'].agg(lambda x: list(dict.fromkeys(x))))
Or remove duplicates first by DataFrame.drop_duplicates
and aggregate list
:
df['col2'] = df['id'].map(df.drop_duplicates(['id','col']).groupby('id')['col'].agg(list))
If possible use ndarray
s instead list
s use DataFrameGroupBy.nunique
:
df['col2'] = df['id'].map(df.groupby('id')['col'].unique())
print (df)
id col col2
0 1 a [a, b, c]
1 1 b [a, b, c]
2 1 c [a, b, c]
3 1 c [a, b, c]
4 2 a [a, b, d]
5 2 b [a, b, d]
6 2 d [a, b, d]