Scenario: This is a follow up to Internally Multiply all pd.dataframe rows based on multiple conditions For this, there is a difference in the dataset that prevents the indexing on the answer to the previous question to be used.
Dataframe example (this is imported with Pandas from a xlsx file):
Model Scenario Region Variable Unit Year1 Year2 ... Year50
1 Base 1 GDP M USD 10 15 20
1 Base 2 GDP M USD 30 35 50
1 Base 3 GDP M USD 20 75 80
1 Stress 1 1 GDP 1 % diff 0.48 0.11 0.31
1 Stress 1 2 GDP 2 % diff 0.12 0.33 0.89
1 Stress 1 3 GDP Comb % diff 0.76 0.54 0.08
1 Stress 2 1 GDP 1 % diff 0.37 0.94 0.13
1 Stress 2 2 GDP 2 % diff 0.73 0.76 0.35
1 Stress 2 3 GDP Comb % diff 0.15 0.45 0.37
1 Stress 3 1 GDP 1 % diff 0.49 0.14 0.37
1 Stress 3 2 GDP 2 % diff 0.14 0.73 0.94
1 Stress 3 3 GDP Comb % diff 0.96 0.26 0.85
Obs: Each of the stress scenarios are % changes over the base scenario (for the same region). Differently from the linked post, the names of the variables are not an exact match between the stress and base scenarios (the number of rows is also different between a given stress scenario and the base scenario). The formula is base value * (1 the stress value).
Obs2: The original dataframe has more models, scenarios, regions and variables, but they are always the same (all models have the same set of scenarios, all scenarios have the same set of regions and so on).
Objective: Is to have the values for each row to be in the same unit as baseline. For that I need to perform the multiplication mentioned above.
The formula would be:
Model Scenario ... Year1 Year2 ... Year50
1 Stress 1 10*(1 0.48) 15*(1 0.11) 20*(1 0.31)
The output would be:
Model Scenario ... Year1 Year2 ... Year50
1 Stress 1 14.8 16.65 26.2
What I already tried: I am trying to use the df.loc function to find the matching values and use them to do the calculation:
test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] = test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Base') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] * (1 test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))])
Obs3: Apparently this kind of multi conditional operation is not ideal, and an indexing based multiplication is better. This operation was the answer to the linked question.
Current solution: From the linked post, the solution is:
years = [Year1,Year2,Year50]
df1 = (test_df[(test_df['Scenario'] == 'Base')]
.set_index(['Model','Region','Variable'])[years])
df2 = (test_df[(test_df['Unit'] == '% diff')]
.set_index(['Model','Scenario','Region','Variable','Unit'])[years])
df = df2.add(1).mul(df1).reset_index()
Question: How can this snippet be improved to account for the mismatch in the "Variable" column between the scenarios?
CodePudding user response:
Problem is if create MultiIndex in df1
also by Variable
it not match with Variable
values in df2
and ouput is NaN
:
years = ['Year1','Year2','Year50']
df1 = (test_df[(test_df['Scenario'] == 'Base')]
.set_index(['Model','Region','Variable'])[years])
print (df1)
Year1 Year2 Year50
Model Region Variable
1 1 GDP 10.0 15.0 20.0
2 GDP 30.0 35.0 50.0
3 GDP 20.0 75.0 80.0
df2 = (test_df[(test_df['Unit'] == '% diff')]
.set_index(['Model','Scenario','Region','Variable','Unit'])[years])
print (df2)
Year1 Year2 Year50
Model Scenario Region Variable Unit
1 Stress 1 1 GDP 1 % diff 0.48 0.11 0.31
2 GDP 2 % diff 0.12 0.33 0.89
3 GDP Comb % diff 0.76 0.54 0.08
Stress 2 1 GDP 1 % diff 0.37 0.94 0.13
2 GDP 2 % diff 0.73 0.76 0.35
3 GDP Comb % diff 0.15 0.45 0.37
Stress 3 1 GDP 1 % diff 0.49 0.14 0.37
2 GDP 2 % diff 0.14 0.73 0.94
3 GDP Comb % diff 0.96 0.26 0.85
Solution is remove Variable
column, so Model
and Region
match, so solution working well:
df1 = (test_df[(test_df['Scenario'] == 'Base')]
.set_index(['Model','Region'])[years])
print (df1)
Year1 Year2 Year50
Model Region
1 1 10.0 15.0 20.0
2 30.0 35.0 50.0
3 20.0 75.0 80.0
df = df2.add(1).mul(df1).reset_index()
print (df)
Model Region Scenario Variable Unit Year1 Year2 Year50
0 1 1 Stress 1 GDP 1 % diff 14.8 16.65 26.2
1 1 1 Stress 2 GDP 1 % diff 13.7 29.10 22.6
2 1 1 Stress 3 GDP 1 % diff 14.9 17.10 27.4
3 1 2 Stress 1 GDP 2 % diff 33.6 46.55 94.5
4 1 2 Stress 2 GDP 2 % diff 51.9 61.60 67.5
5 1 2 Stress 3 GDP 2 % diff 34.2 60.55 97.0
6 1 3 Stress 1 GDP Comb % diff 35.2 115.50 86.4
7 1 3 Stress 2 GDP Comb % diff 23.0 108.75 109.6
8 1 3 Stress 3 GDP Comb % diff 39.2 94.50 148.0
Another idea is possible is create same valuse in Variable
in both DataFrames:
years = ['Year1','Year2','Year50']
#get first 3 letters
test_df['Variable'] = test_df['Variable'].str[:3]
print (test_df)
Model Scenario Region Variable Unit Year1 Year2 Year50
0 1 Base 1 GDP M USD 10.00 15.00 20.00
1 1 Base 2 GDP M USD 30.00 35.00 50.00
2 1 Base 3 GDP M USD 20.00 75.00 80.00
3 1 Stress 1 1 GDP % diff 0.48 0.11 0.31
4 1 Stress 1 2 GDP % diff 0.12 0.33 0.89
5 1 Stress 1 3 GDP % diff 0.76 0.54 0.08
6 1 Stress 2 1 GDP % diff 0.37 0.94 0.13
7 1 Stress 2 2 GDP % diff 0.73 0.76 0.35
8 1 Stress 2 3 GDP % diff 0.15 0.45 0.37
9 1 Stress 3 1 GDP % diff 0.49 0.14 0.37
10 1 Stress 3 2 GDP % diff 0.14 0.73 0.94
11 1 Stress 3 3 GDP % diff 0.96 0.26 0.85
df1 = (test_df[(test_df['Scenario'] == 'Base')]
.set_index(['Model','Region','Variable'])[years])
print (df1)
Year1 Year2 Year50
Model Region Variable
1 1 GDP 10.0 15.0 20.0
2 GDP 30.0 35.0 50.0
3 GDP 20.0 75.0 80.0
df2 = (test_df[(test_df['Unit'] == '% diff')]
.set_index(['Model','Scenario','Region','Variable','Unit'])[years])
print (df2)
Year1 Year2 Year50
Model Scenario Region Variable Unit
1 Stress 1 1 GDP % diff 0.48 0.11 0.31
2 GDP % diff 0.12 0.33 0.89
3 GDP % diff 0.76 0.54 0.08
Stress 2 1 GDP % diff 0.37 0.94 0.13
2 GDP % diff 0.73 0.76 0.35
3 GDP % diff 0.15 0.45 0.37
Stress 3 1 GDP % diff 0.49 0.14 0.37
2 GDP % diff 0.14 0.73 0.94
3 GDP % diff 0.96 0.26 0.85
df = df2.add(1).mul(df1).reset_index()
print (df)
Model Region Variable Scenario Unit Year1 Year2 Year50
0 1 1 GDP Stress 1 % diff 14.8 16.65 26.2
1 1 1 GDP Stress 2 % diff 13.7 29.10 22.6
2 1 1 GDP Stress 3 % diff 14.9 17.10 27.4
3 1 2 GDP Stress 1 % diff 33.6 46.55 94.5
4 1 2 GDP Stress 2 % diff 51.9 61.60 67.5
5 1 2 GDP Stress 3 % diff 34.2 60.55 97.0
6 1 3 GDP Stress 1 % diff 35.2 115.50 86.4
7 1 3 GDP Stress 2 % diff 23.0 108.75 109.6
8 1 3 GDP Stress 3 % diff 39.2 94.50 148.0
CodePudding user response:
Here's a way to do get the result your question describes:
years = [col for col in test_df.columns if col.startswith('Year')]
regionAndYears = ['Region'] years
df_base = test_df.loc[df.Scenario=='Base', regionAndYears].set_index('Region')
scenarioMask = test_df.Scenario != 'Base'
test_df.loc[scenarioMask, years] = ( test_df.loc[scenarioMask, regionAndYears]
.apply(lambda row: (1 row.reindex(years)) * df_base.loc[row.Region,:],
axis=1) )
Output:
Model Scenario Region Variable Unit Year1 Year2 Year50
0 1 Base 1 GDP M USD 10.0 15.00 20.0
1 1 Base 2 GDP M USD 30.0 35.00 50.0
2 1 Base 3 GDP M USD 20.0 75.00 80.0
3 1 Stress 1 1 GDP 1 % diff 14.8 16.65 26.2
4 1 Stress 1 2 GDP 2 % diff 33.6 46.55 94.5
5 1 Stress 1 3 GDP Comb % diff 35.2 115.50 86.4
6 1 Stress 2 1 GDP 1 % diff 13.7 29.10 22.6
7 1 Stress 2 2 GDP 2 % diff 51.9 61.60 67.5
8 1 Stress 2 3 GDP Comb % diff 23.0 108.75 109.6
9 1 Stress 3 1 GDP 1 % diff 14.9 17.10 27.4
10 1 Stress 3 2 GDP 2 % diff 34.2 60.55 97.0
11 1 Stress 3 3 GDP Comb % diff 39.2 94.50 148.0
Explanation:
- create a new dataframe
df_base
containing only the data in the columnsYear1 .. Year50
and indexed byRegion
- for the non-Base
Scenario
rows, useapply()
to perform the desired calculation on the values in the Base row indf_base
with matchingRegion
.
UPDATE:
As a generalization of the original answer above, in response to a comment by OP asking "would this work if there are more models in the dataset", the following code will handle multiple models:
baseKey = ['Model', 'Region']
years = [col for col in df.columns if col.startswith('Year')]
baseKeyAndYears = baseKey years
df_base = df.loc[df.Scenario=='Base', baseKeyAndYears].set_index(baseKey)
scenarioMask = df.Scenario != 'Base'
df.loc[scenarioMask, years] = ( df.loc[scenarioMask, baseKeyAndYears]
.apply(lambda row: (1 row.reindex(years)) * df_base.reindex(index=pd.MultiIndex.from_frame(row[baseKey].astype(int).to_frame().T)).squeeze(),
axis=1) )
Sample input:
Model Scenario Region Variable Unit Year1 Year2 Year50
0 1 Base 1 GDP M USD 10.00 15.00 20.00
1 1 Base 2 GDP M USD 30.00 35.00 50.00
2 1 Base 3 GDP M USD 20.00 75.00 80.00
3 1 Stress 1 1 GDP 1 % diff 0.48 0.11 0.31
4 1 Stress 1 2 GDP 2 % diff 0.12 0.33 0.89
5 1 Stress 1 3 GDP Comb % diff 0.76 0.54 0.08
6 1 Stress 2 1 GDP 1 % diff 0.37 0.94 0.13
7 1 Stress 2 2 GDP 2 % diff 0.73 0.76 0.35
8 1 Stress 2 3 GDP Comb % diff 0.15 0.45 0.37
9 1 Stress 3 1 GDP 1 % diff 0.49 0.14 0.37
10 1 Stress 3 2 GDP 2 % diff 0.14 0.73 0.94
11 1 Stress 3 3 GDP Comb % diff 0.96 0.26 0.85
12 2 Base 1 GDP M USD 100.00 150.00 200.00
13 2 Base 2 GDP M USD 300.00 350.00 500.00
14 2 Base 3 GDP M USD 200.00 750.00 800.00
15 2 Stress 1 1 GDP 1 % diff 0.48 0.11 0.31
16 2 Stress 1 2 GDP 2 % diff 0.12 0.33 0.89
17 2 Stress 1 3 GDP Comb % diff 0.76 0.54 0.08
18 2 Stress 2 1 GDP 1 % diff 0.37 0.94 0.13
19 2 Stress 2 2 GDP 2 % diff 0.73 0.76 0.35
20 2 Stress 2 3 GDP Comb % diff 0.15 0.45 0.37
21 2 Stress 3 1 GDP 1 % diff 0.49 0.14 0.37
22 2 Stress 3 2 GDP 2 % diff 0.14 0.73 0.94
23 2 Stress 3 3 GDP Comb % diff 0.96 0.26 0.85
Sample output:
Model Scenario Region Variable Unit Year1 Year2 Year50
0 1 Base 1 GDP M USD 10.0 15.00 20.0
1 1 Base 2 GDP M USD 30.0 35.00 50.0
2 1 Base 3 GDP M USD 20.0 75.00 80.0
3 1 Stress 1 1 GDP 1 % diff 14.8 16.65 26.2
4 1 Stress 1 2 GDP 2 % diff 33.6 46.55 94.5
5 1 Stress 1 3 GDP Comb % diff 35.2 115.50 86.4
6 1 Stress 2 1 GDP 1 % diff 13.7 29.10 22.6
7 1 Stress 2 2 GDP 2 % diff 51.9 61.60 67.5
8 1 Stress 2 3 GDP Comb % diff 23.0 108.75 109.6
9 1 Stress 3 1 GDP 1 % diff 14.9 17.10 27.4
10 1 Stress 3 2 GDP 2 % diff 34.2 60.55 97.0
11 1 Stress 3 3 GDP Comb % diff 39.2 94.50 148.0
12 2 Base 1 GDP M USD 100.0 150.00 200.0
13 2 Base 2 GDP M USD 300.0 350.00 500.0
14 2 Base 3 GDP M USD 200.0 750.00 800.0
15 2 Stress 1 1 GDP 1 % diff 148.0 166.50 262.0
16 2 Stress 1 2 GDP 2 % diff 336.0 465.50 945.0
17 2 Stress 1 3 GDP Comb % diff 352.0 1155.00 864.0
18 2 Stress 2 1 GDP 1 % diff 137.0 291.00 226.0
19 2 Stress 2 2 GDP 2 % diff 519.0 616.00 675.0
20 2 Stress 2 3 GDP Comb % diff 230.0 1087.50 1096.0
21 2 Stress 3 1 GDP 1 % diff 149.0 171.00 274.0
22 2 Stress 3 2 GDP 2 % diff 342.0 605.50 970.0
23 2 Stress 3 3 GDP Comb % diff 392.0 945.00 1480.0