I have a DataFrame
df1
which contains time-series data with Date
, Location
, and Value
. I try to create df2
with a new column Relative_Value
, which is the value of the specific row devided by the sum of the values for a location on a specific date (groupby date and location).
For example:
- Date=20220101, Location=FE, Value=4: 4/(4 2 6 4 1) = 0.235
df1:
Date Location Value
0 20220101 FE 4
1 20220101 FE 2
2 20220101 FE 6
3 20220101 FE 4
4 20220101 FE 1
5 20220101 RP 4
6 20220101 RP 6
7 20220101 RP 4
8 20220102 FE 1
9 20220102 FE 3
10 20220102 FE 4
11 20220102 FE 1
12 20220102 RP 8
13 20220102 RP 4
14 20220102 RP 5
15 20220102 RP 9
df2:
Date Location Value Relativ_Value
0 20220101 FE 4 0.235 <-- example
1 20220101 FE 2 0.117
2 20220101 FE 6 0.352
3 20220101 FE 4 0.235
4 20220101 FE 1 0.058
5 20220101 RP 4 0.285
6 20220101 RP 6 0.428
7 20220101 RP 4 0.285
8 20220102 FE 1 0.111
9 20220102 FE 3 0.333
10 20220102 FE 4 0.444
11 20220102 FE 1 0.111
12 20220102 RP 8 0.307
13 20220102 RP 4 0.153
14 20220102 RP 5 0.192
15 20220102 RP 9 0.346
I tried it by first groupby the two columns and the apply function:
df2 = df1.groupby(['Date', 'Location']).apply(lambda x: x/sum(x))
What would be an efficient approach to create df2
?
For reproducability:
import pandas as pd
df1 = pd.DataFrame({
'Date':[20220101, 20220101, 20220101, 20220101, 20220101, 20220101, 20220101, 20220101, 20220102, 20220102, 20220102, 20220102, 20220102, 20220102, 20220102, 20220102],
'Location':['FE', 'FE', 'FE', 'FE', 'FE', 'RP', 'RP', 'RP', 'FE', 'FE', 'FE', 'FE', 'RP', 'RP', 'RP', 'RP'],
'Value':[4, 2, 6, 4, 1, 4, 6, 4, 1, 3, 4, 1, 8, 4, 5, 9]})
Many thanks!
CodePudding user response:
Use GroupBy.transform
for improve performance:
df2 = df.assign(Relativ_Value=df['Value']
.div(df1.groupby(['Date', 'Location'])['Value'].transform('sum'))