I have a dataset like:
ymd | revenue |
---|---|
2022-07-17 | 2 |
2022-07-17 | 3 |
2022-07-18 | 4 |
2022-07-18 | 5 |
I want to convert into this format:
ymd | rev_sum |
---|---|
2022-07-17 | 5 |
2022-07-18 | 9 |
I've tried
data["rev_sum"] = sum(
data[data["ymd"] == "2022-07-17"]["revenue"]
)
data["rev_sum"] = sum(
data[data["ymd"] == "2022-07-18"]["revenue"]
)
But this one updates columns with the last date.
I've also tried :
data.loc[data["ymd"] == "2022-07-17"]["rev_sum"] = data.loc[data["ymd"] == "2022-07-17"]["revenue"].sum()
This time I got 'Try using .loc[row_indexer,col_indexer] = value instead' error message.
What should I do? Thank you for your help!
CodePudding user response:
You can use pd.pivot_table
with aggfunc='sum'
on column='revenue'
.
df = pd.pivot_table(df, index=['ymd'], values='revenue',
aggfunc='sum').reset_index()
print(df)
ymd revenue
0 2022-07-17 5
1 2022-07-18 9
CodePudding user response:
import pandas as pd
df = pd.DataFrame({'ymd': {0: '2022-07-17',
1: '2022-07-17',
2: '2022-07-18',
3: '2022-07-18'},
'revenue': {0: 2, 1: 3, 2: 4, 3: 5}})
#to sum values by date
sum_rev = df.groupby('ymd').agg({'revenue':'sum'})
print(sum_rev)
revenue
ymd
2022-07-17 5
2022-07-18 9
#if you want to create a column in existing df, use:
df['sum_rev'] = df.groupby('ymd')['revenue'].transform(sum)
print(df)
ymd revenue rev_sum
0 2022-07-17 2 5
1 2022-07-17 3 5
2 2022-07-18 4 9
3 2022-07-18 5 9