I have a dataframe where I have columns date
, createdAt
, amount
, date and created are dates and amount is float. Example:
1 2020-01, 2020-01, 35.50
2 2020-02, 2020-01, 35.50
3 2020-03, 2020-01, 35.50
4 2020-04, 2020-01, 35.50
5 2020-05, 2020-01, 35.50
6 2020-01, 2020-01, 35.50
7 2020-02, 2020-01, 35.50
8 2020-03, 2020-01, 35.50
9 2020-04, 2020-01, 35.50
10 2020-05, 2020-01, 35.50
11 2020-01, 2020-01, 35.50
12 2020-02, 2020-01, 35.50
.
.
My expected result is to group it so I would get something like:
1 2020-01, 2020-01, 426
2 2020-02, 2020-01, 426
3 2020-03, 2020-01, 426
4 2020-04, 2020-01, 426
5 2020-05, 2020-01, 426
6 2020-01, 2020-02, 426
7 2020-02, 2020-02, 426
8 2020-03, 2020-02, 426
9 2020-04, 2020-02, 426
10 2020-05, 2020-02, 426
11 2020-01, 2020-03, 426
12 2020-02, 2020-03, 426
13 2020-03, 2020-03, 426
14 2020-04, 2020-03, 426
15 2020-05, 2020-03, 426
.
.
.
and other more data with more variation to amount, but two dates would always meet with each other at some point.
Basically my solution was to groupby date
and createdAt
and aggregate amount
with sum.
So something like:
firststep = df.groupby(['date', "createdAt", ])
second_df = firststep.agg({'amount': 'sum'})
reset_df = second_df.reset_index()
But what I get is something like:
1 2020-01, 2020-01, 177.5
2 2020-01, 2020-02, 177.5
3 2020-01, 2020-03, 177.5
4 2020-01, 2020-04, 177.5
5 2020-01, 2020-05, 177.5
6 2020-02, 2020-02, 142
7 2020-02 2020-03, 142
8 2020-02, 2020-04, 142
9 2020-02, 2020-05, 142
10 2020-03, 2020-03, 106.5
11 2020-03, 2020-04, 106.5
12 2020-03, 2020-05, 106.5
.
.
My values was supposed to meet up with each other at some point but some groupings are missing, and its starting after the previous date. Like after 2020-01, 2020-05
,the next row is 2020-02, 2020-02
and not 2020-02, 2020-01
Im figuring out how to group it by the two columns and not losing some groupings. How do I get my desired output in a dataframe?
CodePudding user response:
If I understand you correctly and based upon your input and your desired output you should not group the dataframe based on both date
and createdAt
columns, rather you should group it just based on the createdAt
:
df.groupby("createdAt").sum()
Output
createdAt | amount |
---|---|
2020-01 | 426 |
CodePudding user response:
Try:
df["amount"] = df.groupby("createdAt")["amount"].transform("sum")
>>> df
date createdAt amount
0 2020-01 2020-01 426.0
1 2020-02 2020-01 426.0
2 2020-03 2020-01 426.0
3 2020-04 2020-01 426.0
4 2020-05 2020-01 426.0
5 2020-01 2020-01 426.0
6 2020-02 2020-01 426.0
7 2020-03 2020-01 426.0
8 2020-04 2020-01 426.0
9 2020-05 2020-01 426.0
10 2020-01 2020-01 426.0
11 2020-02 2020-01 426.0