Scenario: I am trying to multiple rows within a dataframe based on a set of conditions.
Dataframe example (this is import with Pandas from a xlsx file):
Model Scenario Region Variable Unit Year1 Year2 ... Year50
1 Base 1 GDP M USD 10 15 20
1 Base 2 GDP M USD 30 35 50
1 Base 3 GDP M USD 20 75 80
1 Stress 1 1 GDP % diff 0.48 0.11 0.31
1 Stress 1 2 GDP % diff 0.12 0.33 0.89
1 Stress 1 3 GDP % diff 0.76 0.54 0.08
1 Stress 2 1 GDP % diff 0.37 0.94 0.13
1 Stress 2 2 GDP % diff 0.73 0.76 0.35
1 Stress 2 3 GDP % diff 0.15 0.45 0.37
1 Stress 3 1 GDP % diff 0.49 0.14 0.37
1 Stress 3 2 GDP % diff 0.14 0.73 0.94
1 Stress 3 3 GDP % diff 0.96 0.26 0.85
Obs: Each of the stress scenarios are % changes over the base scenario (for the same region and variable). That means they are the base value * 1 the stress value.
Obs2: The original dataframe has more models, scenarios, regions and variables, but they are always the same (all models have the same set of scenarios, all scenarios have the same set of regions and so on).
Objective: Is to have the values for each row to be in the same unit as baseline. For that I need to perform the multiplication mentioned above.
The formula would be:
Model Scenario ... Year1 Year2 ... Year50
1 Stress 1 10*(1 0.48) 15*(1 0.11) 20*(1 0.31)
The output would be:
Model Scenario ... Year1 Year2 ... Year50
1 Stress 1 14.8 16.65 26.2
What I already tried: I am trying to use the df.loc function to find the matching values and use them to do the calculation:
test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] = test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Base') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] * (1 test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))])
Obs3: I can see 2 issues with this approach: First that I am not properly able to control the values on the "Year" columns. Second is that I am unsure how to perform this operation at once for the whole dataframe without having to create one line for each possible combination of model/scenario/region/variable available.
Question: Is there a way to perform this operation? If so, what would be the best way?
CodePudding user response:
First filter Base
DataFrame - in sample data is possible use only basic condition test_df['Scenario'] == 'Base'
and convert columns used for correct align anothr DataFrame - here 'Model','Region','Variable'
, Scenario, Unit
is different, so omitted and also are filtered columns with list:
years = [Year1,Year2,Year50]
df1 = (test_df[(test_df['Scenario'] == 'Base')]
.set_index(['Model','Region','Variable'])[years])
print (df1)
Year1 Year2 Year50
Model Region Variable
1 1 GDP 10.0 15.0 20.0
2 GDP 30.0 35.0 50.0
3 GDP 20.0 75.0 80.0
Similar way is used for df2
:
df2 = (test_df[(test_df['Unit'] == '% diff')]
.set_index(['Model','Scenario','Region','Variable','Unit'])[years])
print (df2)
Year1 Year2 Year50
Model Scenario Region Variable Unit
1 Stress 1 1 GDP % diff 0.48 0.11 0.31
2 GDP % diff 0.12 0.33 0.89
3 GDP % diff 0.76 0.54 0.08
Stress 2 1 GDP % diff 0.37 0.94 0.13
2 GDP % diff 0.73 0.76 0.35
3 GDP % diff 0.15 0.45 0.37
Stress 3 1 GDP % diff 0.49 0.14 0.37
2 GDP % diff 0.14 0.73 0.94
3 GDP % diff 0.96 0.26 0.85
Because match some levels in df1.index
with df2.index
and also are unique index in df1
is possible after add 1
to df1
multiple by df2
:
df = df2.add(1).mul(df1).reset_index()
print (df)
Model Region Variable Scenario Unit Year1 Year2 Year50
0 1 1 GDP Stress 1 % diff 14.8 16.65 26.2
1 1 1 GDP Stress 2 % diff 13.7 29.10 22.6
2 1 1 GDP Stress 3 % diff 14.9 17.10 27.4
3 1 2 GDP Stress 1 % diff 33.6 46.55 94.5
4 1 2 GDP Stress 2 % diff 51.9 61.60 67.5
5 1 2 GDP Stress 3 % diff 34.2 60.55 97.0
6 1 3 GDP Stress 1 % diff 35.2 115.50 86.4
7 1 3 GDP Stress 2 % diff 23.0 108.75 109.6
8 1 3 GDP Stress 3 % diff 39.2 94.50 148.0