Groupby keeping all the information of a database-CodePudding

let's say I have a dataframe A:

a0 a1    a2
1  1980  Y04
1  1983  Y08
1  1990  Y08
1  1980  Y08
1  1998  Y10
2  2003  Y02
3  1970  Y04
3  2009  Y10
3  1995  Y34

what I would like to do (in dask, but any suggestion, in Pandas as well, is helpful) is to make a groupby a0 retaining all the unique information contained in a1 and a2. In other words I would like to obtain the following:

a0  newa_1                        new_a2
1  [1980, 1983, 1990, 1998]  [Y04, Y08, Y10]
2  [2003]                          [Y02]
3  [1970,2009, 1995]               [Y04]

Thank you

CodePudding user response：

Use agg(col_name, unique) after groupby:

df.groupby('a0').agg(newa_1=('a1','unique'), newa_2=('a2','unique'))

Output:

       newa_1                           newa_2
a0      
1   [1980, 1983, 1990, 1998]    [Y04, Y08, Y10]
2   [2003]                            [Y02]
3   [1970, 2009, 1995]           [Y04, Y10, Y34]

CodePudding user response：

Using the unique function and reset the index to get a dataframe.

>>> df.groupby("a0").agg(newa_1=('a1',"unique"), new_a2=("a2","unique")).reset_index()

    a0  newa_1                      new_a2
0   1   [1980, 1983, 1990, 1998]    [Y04, Y08, Y10]
1   2   [2003]                      [Y02]
2   3   [1970, 2009, 1995]          [Y04, Y10, Y34]