let's say I have a dataframe A:
a0 a1 a2
1 1980 Y04
1 1983 Y08
1 1990 Y08
1 1980 Y08
1 1998 Y10
2 2003 Y02
3 1970 Y04
3 2009 Y10
3 1995 Y34
what I would like to do (in dask, but any suggestion, in Pandas as well, is helpful) is to make a groupby a0 retaining all the unique information contained in a1 and a2. In other words I would like to obtain the following:
a0 newa_1 new_a2
1 [1980, 1983, 1990, 1998] [Y04, Y08, Y10]
2 [2003] [Y02]
3 [1970,2009, 1995] [Y04]
Thank you
CodePudding user response:
Use agg(col_name, unique)
after groupby
:
df.groupby('a0').agg(newa_1=('a1','unique'), newa_2=('a2','unique'))
Output:
newa_1 newa_2
a0
1 [1980, 1983, 1990, 1998] [Y04, Y08, Y10]
2 [2003] [Y02]
3 [1970, 2009, 1995] [Y04, Y10, Y34]
CodePudding user response:
Using the unique function and reset the index to get a dataframe.
>>> df.groupby("a0").agg(newa_1=('a1',"unique"), new_a2=("a2","unique")).reset_index()
a0 newa_1 new_a2
0 1 [1980, 1983, 1990, 1998] [Y04, Y08, Y10]
1 2 [2003] [Y02]
2 3 [1970, 2009, 1995] [Y04, Y10, Y34]