Home > Software engineering >  Group by month, sum rows based in column, and keep the other columns
Group by month, sum rows based in column, and keep the other columns

Time:03-30

I have a DataFrame df as follows:

|size    | date        | name | type     | revenue |
|10      | 13/12/2021  | A    | Standard | 0,2     |
|248743  | 15/12/2021  | A    | Standard | 0,2     |
|234     | 03/12/2022  | A    | Basic    | 0,1     |
|8734684 | 31/03/2022  | B    | Basic    | 0,1     |
|3589749 | 01/04/2021  | C    | Basic    | 0,4     |
|3356943 | 02/04/2021  | A    | Basic    | 0,1     |
|6908746 | 21/04/2021  | A    | Basic    | 0,1     |
|2375940 | 21/02/2022  | D    | Premium  | 0,7     |
|9387295 | 21/02/2022  | D    | Premium  | 0,7     |
|286432  | 21/02/2022  | D    | Premium  | 0,7     |
|192     | 31/03/2022  | D    | Premium  | 0,7     |
|486     | 18/02/2022  | E    | Standard | 0,9     |
|23847   | 24/10/2021  | F    | Basic    | 0,3     |
|82346   | 12/11/2021  | B    | Premium  | 0,5     |
|28352   | 03/01/2022  | A    | Basic    | 0,1     |

I need to group by month with the size sum for rows which name and type are the same:

|size    | date | name | type     | revenue |
|28352   | Jan  | A    | Basic    | 0,1     |
|486     | Feb  | E    | Standard | 0,9     |
|12049667| Feb  | D    | Premium  | 0,7     |
|192     | Mar  | D    | Premium  | 0,7     |
|8734684 | Mar  | B    | Basic    | 0,1     |
|3589749 | Apr  | C    | Basic    | 0,4     |
|10265689| Apr  | A    | Basic    | 0,1     |
|23847   | Oct  | F    | Basic    | 0,3     |
|82346   | Nov  | B    | Premium  | 0,5     |
|248753  | Dec  | A    | Standard | 0,2     |
|234     | Dec  | A    | Basic    | 0,1     |

I tried this code but it did not work:

df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.strftime('%B'))['size'].sum()
df2 = df1.groupby(['date', 'name', 'type', 'revenue'],as_index=False).sum()

How can I do it?

CodePudding user response:

IIUC, you need a single groupby. You need to rework your "revenue" column as numeric.

df['date'] = pd.to_datetime(df['date'], dayfirst=True)

group = df['date'].dt.strftime('%b')

(df.assign(revenue=pd.to_numeric(df['revenue'].str.replace(',', '.')))
   .groupby([group, 'name', 'type'])
   .agg('sum')
   .reset_index()
 )

Output:

   date name      type      size  revenue
0   Apr    A     Basic   6908746      0.1
1   Dec    A  Standard    248753      0.4
2   Dec    B   Premium     82346      0.5
3   Feb    A     Basic   3356943      0.1
4   Feb    D   Premium  12049667      2.1
5   Feb    E  Standard       486      0.9
6   Jan    C     Basic   3589749      0.4
7   Mar    A     Basic     28586      0.2
8   Mar    B     Basic   8734684      0.1
9   Mar    D   Premium       192      0.7
10  Oct    F     Basic     23847      0.3

Note that the above is aggregating months of different years into the same group. If you want to keep years separate, use a period:

group = df['date'].dt.to_period('M')

Output:

       date name      type      size  revenue
0   2021-01    C     Basic   3589749      0.4
1   2021-02    A     Basic   3356943      0.1
2   2021-04    A     Basic   6908746      0.1
3   2021-10    F     Basic     23847      0.3
4   2021-12    A  Standard    248753      0.4
5   2021-12    B   Premium     82346      0.5
6   2022-02    D   Premium  12049667      2.1
7   2022-02    E  Standard       486      0.9
8   2022-03    A     Basic     28586      0.2
9   2022-03    B     Basic   8734684      0.1
10  2022-03    D   Premium       192      0.7
  • Related