Home > Software design >  Internally Multiply all pd.dataframe rows based on multiple conditions pt.2
Internally Multiply all pd.dataframe rows based on multiple conditions pt.2

Time:01-13

Scenario: This is a follow up to Internally Multiply all pd.dataframe rows based on multiple conditions For this, there is a difference in the dataset that prevents the indexing on the answer to the previous question to be used.

Dataframe example (this is imported with Pandas from a xlsx file):

Model    Scenario    Region  Variable     Unit    Year1    Year2    ...    Year50
  1        Base        1        GDP      M USD     10       15               20  
  1        Base        2        GDP      M USD     30       35               50  
  1        Base        3        GDP      M USD     20       75               80  
  1        Stress 1    1       GDP 1     % diff    0.48    0.11             0.31  
  1        Stress 1    2       GDP 2     % diff    0.12    0.33             0.89  
  1        Stress 1    3     GDP Comb    % diff    0.76    0.54             0.08  
  1        Stress 2    1       GDP 1     % diff    0.37    0.94             0.13  
  1        Stress 2    2       GDP 2     % diff    0.73    0.76             0.35  
  1        Stress 2    3     GDP Comb    % diff    0.15    0.45             0.37  
  1        Stress 3    1       GDP 1     % diff    0.49    0.14             0.37  
  1        Stress 3    2       GDP 2     % diff    0.14    0.73             0.94  
  1        Stress 3    3     GDP Comb    % diff    0.96    0.26             0.85  

Obs: Each of the stress scenarios are % changes over the base scenario (for the same region). Differently from the linked post, the names of the variables are not an exact match between the stress and base scenarios (the number of rows is also different between a given stress scenario and the base scenario). The formula is base value * (1 the stress value).

Obs2: The original dataframe has more models, scenarios, regions and variables, but they are always the same (all models have the same set of scenarios, all scenarios have the same set of regions and so on).

Objective: Is to have the values for each row to be in the same unit as baseline. For that I need to perform the multiplication mentioned above.

The formula would be:

Model    Scenario    ...    Year1          Year2    ...     Year50
  1      Stress 1        10*(1 0.48)    15*(1 0.11)        20*(1 0.31) 

The output would be:

Model    Scenario    ...    Year1          Year2    ...     Year50
  1      Stress 1           14.8           16.65             26.2 

What I already tried: I am trying to use the df.loc function to find the matching values and use them to do the calculation:

test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] = test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Base') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] * (1   test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))])

Obs3: Apparently this kind of multi conditional operation is not ideal, and an indexing based multiplication is better. This operation was the answer to the linked question.

Current solution: From the linked post, the solution is:

years = [Year1,Year2,Year50]
df1 = (test_df[(test_df['Scenario'] == 'Base')]
             .set_index(['Model','Region','Variable'])[years])
df2 = (test_df[(test_df['Unit'] == '% diff')]
             .set_index(['Model','Scenario','Region','Variable','Unit'])[years])
df = df2.add(1).mul(df1).reset_index()

Question: How can this snippet be improved to account for the mismatch in the "Variable" column between the scenarios?

CodePudding user response:

Problem is if create MultiIndex in df1 also by Variable it not match with Variable values in df2 and ouput is NaN:

years = ['Year1','Year2','Year50']

df1 = (test_df[(test_df['Scenario'] == 'Base')]
             .set_index(['Model','Region','Variable'])[years])
print (df1)
                       Year1  Year2  Year50
Model Region Variable                      
1     1      GDP        10.0   15.0    20.0
      2      GDP        30.0   35.0    50.0
      3      GDP        20.0   75.0    80.0
      
df2 = (test_df[(test_df['Unit'] == '% diff')]
             .set_index(['Model','Scenario','Region','Variable','Unit'])[years])
print (df2)
                                       Year1  Year2  Year50
Model Scenario Region Variable Unit                        
1     Stress 1 1      GDP 1    % diff   0.48   0.11    0.31
               2      GDP 2    % diff   0.12   0.33    0.89
               3      GDP Comb % diff   0.76   0.54    0.08
      Stress 2 1      GDP 1    % diff   0.37   0.94    0.13
               2      GDP 2    % diff   0.73   0.76    0.35
               3      GDP Comb % diff   0.15   0.45    0.37
      Stress 3 1      GDP 1    % diff   0.49   0.14    0.37
               2      GDP 2    % diff   0.14   0.73    0.94
               3      GDP Comb % diff   0.96   0.26    0.85

Solution is remove Variable column, so Model and Region match, so solution working well:

df1 = (test_df[(test_df['Scenario'] == 'Base')]
             .set_index(['Model','Region'])[years])
print (df1)
              Year1  Year2  Year50
Model Region                      
1     1        10.0   15.0    20.0
      2        30.0   35.0    50.0
      3        20.0   75.0    80.0

               
df = df2.add(1).mul(df1).reset_index()
print (df)
   Model  Region  Scenario  Variable    Unit  Year1   Year2  Year50
0      1       1  Stress 1     GDP 1  % diff   14.8   16.65    26.2
1      1       1  Stress 2     GDP 1  % diff   13.7   29.10    22.6
2      1       1  Stress 3     GDP 1  % diff   14.9   17.10    27.4
3      1       2  Stress 1     GDP 2  % diff   33.6   46.55    94.5
4      1       2  Stress 2     GDP 2  % diff   51.9   61.60    67.5
5      1       2  Stress 3     GDP 2  % diff   34.2   60.55    97.0
6      1       3  Stress 1  GDP Comb  % diff   35.2  115.50    86.4
7      1       3  Stress 2  GDP Comb  % diff   23.0  108.75   109.6
8      1       3  Stress 3  GDP Comb  % diff   39.2   94.50   148.0

Another idea is possible is create same valuse in Variable in both DataFrames:

years = ['Year1','Year2','Year50']

#get first 3 letters
test_df['Variable'] = test_df['Variable'].str[:3]
print (test_df)
    Model  Scenario  Region Variable    Unit  Year1  Year2  Year50
0       1      Base       1      GDP   M USD  10.00  15.00   20.00
1       1      Base       2      GDP   M USD  30.00  35.00   50.00
2       1      Base       3      GDP   M USD  20.00  75.00   80.00
3       1  Stress 1       1      GDP  % diff   0.48   0.11    0.31
4       1  Stress 1       2      GDP  % diff   0.12   0.33    0.89
5       1  Stress 1       3      GDP  % diff   0.76   0.54    0.08
6       1  Stress 2       1      GDP  % diff   0.37   0.94    0.13
7       1  Stress 2       2      GDP  % diff   0.73   0.76    0.35
8       1  Stress 2       3      GDP  % diff   0.15   0.45    0.37
9       1  Stress 3       1      GDP  % diff   0.49   0.14    0.37
10      1  Stress 3       2      GDP  % diff   0.14   0.73    0.94
11      1  Stress 3       3      GDP  % diff   0.96   0.26    0.85

df1 = (test_df[(test_df['Scenario'] == 'Base')]
             .set_index(['Model','Region','Variable'])[years])
print (df1)
                       Year1  Year2  Year50
Model Region Variable                      
1     1      GDP        10.0   15.0    20.0
      2      GDP        30.0   35.0    50.0
      3      GDP        20.0   75.0    80.0

df2 = (test_df[(test_df['Unit'] == '% diff')]
             .set_index(['Model','Scenario','Region','Variable','Unit'])[years])
print (df2)
                                       Year1  Year2  Year50
Model Scenario Region Variable Unit                        
1     Stress 1 1      GDP      % diff   0.48   0.11    0.31
               2      GDP      % diff   0.12   0.33    0.89
               3      GDP      % diff   0.76   0.54    0.08
      Stress 2 1      GDP      % diff   0.37   0.94    0.13
               2      GDP      % diff   0.73   0.76    0.35
               3      GDP      % diff   0.15   0.45    0.37
      Stress 3 1      GDP      % diff   0.49   0.14    0.37
               2      GDP      % diff   0.14   0.73    0.94
               3      GDP      % diff   0.96   0.26    0.85

df = df2.add(1).mul(df1).reset_index()
print (df)
   Model  Region Variable  Scenario    Unit  Year1   Year2  Year50
0      1       1      GDP  Stress 1  % diff   14.8   16.65    26.2
1      1       1      GDP  Stress 2  % diff   13.7   29.10    22.6
2      1       1      GDP  Stress 3  % diff   14.9   17.10    27.4
3      1       2      GDP  Stress 1  % diff   33.6   46.55    94.5
4      1       2      GDP  Stress 2  % diff   51.9   61.60    67.5
5      1       2      GDP  Stress 3  % diff   34.2   60.55    97.0
6      1       3      GDP  Stress 1  % diff   35.2  115.50    86.4
7      1       3      GDP  Stress 2  % diff   23.0  108.75   109.6
8      1       3      GDP  Stress 3  % diff   39.2   94.50   148.0

CodePudding user response:

Here's a way to do get the result your question describes:

years = [col for col in test_df.columns if col.startswith('Year')]
regionAndYears = ['Region']   years
df_base = test_df.loc[df.Scenario=='Base', regionAndYears].set_index('Region')
scenarioMask = test_df.Scenario != 'Base'
test_df.loc[scenarioMask, years] = ( test_df.loc[scenarioMask, regionAndYears]
    .apply(lambda row: (1   row.reindex(years)) * df_base.loc[row.Region,:], 
    axis=1) )

Output:

    Model  Scenario  Region  Variable    Unit  Year1   Year2  Year50
0       1      Base       1       GDP   M USD   10.0   15.00    20.0
1       1      Base       2       GDP   M USD   30.0   35.00    50.0
2       1      Base       3       GDP   M USD   20.0   75.00    80.0
3       1  Stress 1       1     GDP 1  % diff   14.8   16.65    26.2
4       1  Stress 1       2     GDP 2  % diff   33.6   46.55    94.5
5       1  Stress 1       3  GDP Comb  % diff   35.2  115.50    86.4
6       1  Stress 2       1     GDP 1  % diff   13.7   29.10    22.6
7       1  Stress 2       2     GDP 2  % diff   51.9   61.60    67.5
8       1  Stress 2       3  GDP Comb  % diff   23.0  108.75   109.6
9       1  Stress 3       1     GDP 1  % diff   14.9   17.10    27.4
10      1  Stress 3       2     GDP 2  % diff   34.2   60.55    97.0
11      1  Stress 3       3  GDP Comb  % diff   39.2   94.50   148.0

Explanation:

  • create a new dataframe df_base containing only the data in the columns Year1 .. Year50 and indexed by Region
  • for the non-Base Scenario rows, use apply() to perform the desired calculation on the values in the Base row in df_base with matching Region.

UPDATE:

As a generalization of the original answer above, in response to a comment by OP asking "would this work if there are more models in the dataset", the following code will handle multiple models:

baseKey = ['Model', 'Region']
years = [col for col in df.columns if col.startswith('Year')]
baseKeyAndYears = baseKey   years
df_base = df.loc[df.Scenario=='Base', baseKeyAndYears].set_index(baseKey)
scenarioMask = df.Scenario != 'Base'
df.loc[scenarioMask, years] = ( df.loc[scenarioMask, baseKeyAndYears]
    .apply(lambda row: (1   row.reindex(years)) * df_base.reindex(index=pd.MultiIndex.from_frame(row[baseKey].astype(int).to_frame().T)).squeeze(),
    axis=1) )

Sample input:

    Model  Scenario  Region  Variable    Unit   Year1   Year2  Year50
0       1      Base       1       GDP   M USD   10.00   15.00   20.00
1       1      Base       2       GDP   M USD   30.00   35.00   50.00
2       1      Base       3       GDP   M USD   20.00   75.00   80.00
3       1  Stress 1       1     GDP 1  % diff    0.48    0.11    0.31
4       1  Stress 1       2     GDP 2  % diff    0.12    0.33    0.89
5       1  Stress 1       3  GDP Comb  % diff    0.76    0.54    0.08
6       1  Stress 2       1     GDP 1  % diff    0.37    0.94    0.13
7       1  Stress 2       2     GDP 2  % diff    0.73    0.76    0.35
8       1  Stress 2       3  GDP Comb  % diff    0.15    0.45    0.37
9       1  Stress 3       1     GDP 1  % diff    0.49    0.14    0.37
10      1  Stress 3       2     GDP 2  % diff    0.14    0.73    0.94
11      1  Stress 3       3  GDP Comb  % diff    0.96    0.26    0.85
12      2      Base       1       GDP   M USD  100.00  150.00  200.00
13      2      Base       2       GDP   M USD  300.00  350.00  500.00
14      2      Base       3       GDP   M USD  200.00  750.00  800.00
15      2  Stress 1       1     GDP 1  % diff    0.48    0.11    0.31
16      2  Stress 1       2     GDP 2  % diff    0.12    0.33    0.89
17      2  Stress 1       3  GDP Comb  % diff    0.76    0.54    0.08
18      2  Stress 2       1     GDP 1  % diff    0.37    0.94    0.13
19      2  Stress 2       2     GDP 2  % diff    0.73    0.76    0.35
20      2  Stress 2       3  GDP Comb  % diff    0.15    0.45    0.37
21      2  Stress 3       1     GDP 1  % diff    0.49    0.14    0.37
22      2  Stress 3       2     GDP 2  % diff    0.14    0.73    0.94
23      2  Stress 3       3  GDP Comb  % diff    0.96    0.26    0.85

Sample output:

    Model  Scenario  Region  Variable    Unit  Year1    Year2  Year50
0       1      Base       1       GDP   M USD   10.0    15.00    20.0
1       1      Base       2       GDP   M USD   30.0    35.00    50.0
2       1      Base       3       GDP   M USD   20.0    75.00    80.0
3       1  Stress 1       1     GDP 1  % diff   14.8    16.65    26.2
4       1  Stress 1       2     GDP 2  % diff   33.6    46.55    94.5
5       1  Stress 1       3  GDP Comb  % diff   35.2   115.50    86.4
6       1  Stress 2       1     GDP 1  % diff   13.7    29.10    22.6
7       1  Stress 2       2     GDP 2  % diff   51.9    61.60    67.5
8       1  Stress 2       3  GDP Comb  % diff   23.0   108.75   109.6
9       1  Stress 3       1     GDP 1  % diff   14.9    17.10    27.4
10      1  Stress 3       2     GDP 2  % diff   34.2    60.55    97.0
11      1  Stress 3       3  GDP Comb  % diff   39.2    94.50   148.0
12      2      Base       1       GDP   M USD  100.0   150.00   200.0
13      2      Base       2       GDP   M USD  300.0   350.00   500.0
14      2      Base       3       GDP   M USD  200.0   750.00   800.0
15      2  Stress 1       1     GDP 1  % diff  148.0   166.50   262.0
16      2  Stress 1       2     GDP 2  % diff  336.0   465.50   945.0
17      2  Stress 1       3  GDP Comb  % diff  352.0  1155.00   864.0
18      2  Stress 2       1     GDP 1  % diff  137.0   291.00   226.0
19      2  Stress 2       2     GDP 2  % diff  519.0   616.00   675.0
20      2  Stress 2       3  GDP Comb  % diff  230.0  1087.50  1096.0
21      2  Stress 3       1     GDP 1  % diff  149.0   171.00   274.0
22      2  Stress 3       2     GDP 2  % diff  342.0   605.50   970.0
23      2  Stress 3       3  GDP Comb  % diff  392.0   945.00  1480.0
  • Related