Home > Software engineering >  Pandas df groupby dates period multiple
Pandas df groupby dates period multiple

Time:03-11

I have a dataframe where I have columns date, createdAt, amount, date and created are dates and amount is float. Example:

1    2020-01, 2020-01,  35.50
2    2020-02, 2020-01,  35.50
3    2020-03, 2020-01,  35.50
4    2020-04, 2020-01,  35.50
5    2020-05, 2020-01,  35.50
6    2020-01, 2020-01,  35.50
7    2020-02, 2020-01,  35.50
8    2020-03, 2020-01,  35.50
9    2020-04, 2020-01,  35.50
10    2020-05, 2020-01,  35.50
11    2020-01, 2020-01,  35.50
12    2020-02, 2020-01,  35.50
.
.

My expected result is to group it so I would get something like:

1    2020-01, 2020-01,  426
2    2020-02, 2020-01,  426
3    2020-03, 2020-01,  426
4    2020-04, 2020-01,  426
5    2020-05, 2020-01,  426
6    2020-01, 2020-02,  426
7    2020-02, 2020-02,  426
8    2020-03, 2020-02,  426
9    2020-04, 2020-02,  426
10    2020-05, 2020-02,  426
11   2020-01, 2020-03,  426
12    2020-02, 2020-03,  426
13    2020-03, 2020-03,  426
14    2020-04, 2020-03,  426
15    2020-05, 2020-03,  426
.
.
.
 and other more data with more variation to amount, but two dates would always meet with each other at some point.

Basically my solution was to groupby date and createdAt and aggregate amount with sum.

So something like:

firststep = df.groupby(['date', "createdAt", ])
second_df = firststep.agg({'amount': 'sum'})
reset_df = second_df.reset_index()

But what I get is something like:

1    2020-01, 2020-01,  177.5
2    2020-01, 2020-02,  177.5
3    2020-01, 2020-03,  177.5
4    2020-01, 2020-04,  177.5
5    2020-01, 2020-05,  177.5
6    2020-02, 2020-02,  142
7    2020-02 2020-03,  142
8    2020-02, 2020-04,  142
9    2020-02, 2020-05,  142
10    2020-03, 2020-03,  106.5
11    2020-03, 2020-04,  106.5
12    2020-03, 2020-05,  106.5
.
.

My values was supposed to meet up with each other at some point but some groupings are missing, and its starting after the previous date. Like after 2020-01, 2020-05,the next row is 2020-02, 2020-02and not 2020-02, 2020-01

Im figuring out how to group it by the two columns and not losing some groupings. How do I get my desired output in a dataframe?

CodePudding user response:

If I understand you correctly and based upon your input and your desired output you should not group the dataframe based on both date and createdAt columns, rather you should group it just based on the createdAt:

df.groupby("createdAt").sum()

Output

createdAt amount
2020-01 426

CodePudding user response:

Try:

df["amount"] = df.groupby("createdAt")["amount"].transform("sum")

>>> df
       date createdAt  amount
0   2020-01   2020-01   426.0
1   2020-02   2020-01   426.0
2   2020-03   2020-01   426.0
3   2020-04   2020-01   426.0
4   2020-05   2020-01   426.0
5   2020-01   2020-01   426.0
6   2020-02   2020-01   426.0
7   2020-03   2020-01   426.0
8   2020-04   2020-01   426.0
9   2020-05   2020-01   426.0
10  2020-01   2020-01   426.0
11  2020-02   2020-01   426.0
  • Related