I have a dataframe which looks like this:
data1 = [['2020-10-01', '07-08', 3.0 ], ['2020-10-01', '08-09', 2.0], ['2020-10-01', '07-08', 3.0], ['2020-10-01', '07-08', 3.0],['2020-10-02', '07-08', 3.0 ], ['2020-10-02', '08-09', 3.0], ['2020-10-02', '07-08', 3.0], ['2020-10-02', '08-09', 3.0], ['2020-10-03', '09-10', 9.0], ['2020-10-03', '09-10', 9.0]]
df1 = pd.DataFrame(data1, columns = ['Date', 'TimeCategory', 'Value_TimeCategory_total'])
Date | TimeCategory | Value_TimeCategory_total |
---|---|---|
2020-10-01 | 07-08 | 3.0 |
2020-10-01 | 08-09 | 2.0 |
2020-10-01 | 07-08 | 3.0 |
2020-10-01 | 07-08 | 3.0 |
2020-10-02 | 07-08 | 3.0 |
2020-10-02 | 08-09 | 3.0 |
2020-10-02 | 07-08 | 3.0 |
2020-10-02 | 08-09 | 3.0 |
2020-10-03 | 09-10 | 9.0 |
2020-10-03 | 09-10 | 9.0 |
The Dataframe contains the total values for each TimeCategory during one day.
Now I would like to add a column to this dataframe which displays the average value for each TimeCategory during each day.
If I have 3 rows for with date 2020-10-01 and with the TimeCategory 07-08 and the total value equals 3.0, I would like to have the average value equals 1.0.
The result should look like this.
data2 = [['2020-10-01', '07-08', 3.0 , 1.0], ['2020-10-01', '08-09', 2.0, 2.0], ['2020-10-01', '07-08', 3.0, 1.0], ['2020-10-01', '07-08', 3.0, 1.0],['2020-10-02', '07-08', 3.0, 1.5 ], ['2020-10-02', '08-09', 3.0, 1.5], ['2020-10-02', '07-08', 3.0, 1.5], ['2020-10-02', '08-09', 3.0, 1.5], ['2020-10-03', '09-10', 9.0, 4.5], ['2020-10-03', '09-10', 9.0, 4.5]]
df2 = pd.DataFrame(data2, columns = ['Date', 'TimeCategory', 'Value_TimeCategory_total' , 'Value_TimeCategory_Row_Average'])
df2
Date | TimeCategory | Value_TimeCategory_total | Value_TimeCategory_Row_Average |
---|---|---|---|
2020-10-01 | 07-08 | 3.0 | 1.0 |
2020-10-01 | 08-09 | 2.0 | 2.0 |
2020-10-01 | 07-08 | 3.0 | 1.0 |
2020-10-01 | 07-08 | 3.0 | 1.0 |
2020-10-02 | 07-08 | 3.0 | 1.5 |
2020-10-02 | 08-09 | 3.0 | 1.5 |
2020-10-02 | 07-08 | 3.0 | 1.5 |
2020-10-02 | 08-09 | 3.0 | 1.5 |
2020-10-03 | 09-10 | 9.0 | 4.5 |
2020-10-03 | 09-10 | 9.0 | 4.5 |
I do not want to use group by, because I need all rows (including duplicates) of my dataframe.
Many thanks in advance for your help.
CodePudding user response:
Idea is divide column Value_TimeCategory_total
by counts per groups in GroupBy.transform
for get Series
with same size like original:
df1['Value_TimeCategory_Row_Average'] = (df1['Value_TimeCategory_total']
.div(df1.groupby(['Date','TimeCategory'])['Value_TimeCategory_total']
.transform('size')))
print (df1)
Date TimeCategory Value_TimeCategory_total \
0 2020-10-01 07-08 3.0
1 2020-10-01 08-09 2.0
2 2020-10-01 07-08 3.0
3 2020-10-01 07-08 3.0
4 2020-10-02 07-08 3.0
5 2020-10-02 08-09 3.0
6 2020-10-02 07-08 3.0
7 2020-10-02 08-09 3.0
8 2020-10-03 09-10 9.0
9 2020-10-03 09-10 9.0
Value_TimeCategory_Row_Average
0 1.0
1 2.0
2 1.0
3 1.0
4 1.5
5 1.5
6 1.5
7 1.5
8 4.5
9 4.5
Alternative solution:
df1['Value_TimeCategory_Row_Average'] = (df1.groupby(['Date','TimeCategory'])['Value_TimeCategory_total']
.transform(lambda x: x / len(x)))
CodePudding user response:
So, grouped by Date, TimeCategory
, the other cells have respectively the same value.
I do not think groupby
is not necessarily helpful to achieve what you need - you just need to combine it with assign
:
df2.set_index(["Date", "TimeCategory"], inplace=True)
df2 = df2.assign(Value_TimeCategory_Row_Average = df2.groupby(["Date", "TimeCategory"]).apply(lambda x:x["Value_TimeCategory_total"].mean() / len(x["Value_TimeCategory_total"])))