Internally Multiply all pd.dataframe rows based on multiple conditions-CodePudding

Scenario: I am trying to multiple rows within a dataframe based on a set of conditions.

Dataframe example (this is import with Pandas from a xlsx file):

Model    Scenario    Region    Variable    Unit    Year1    Year2    ...    Year50
  1        Base        1         GDP      M USD     10       15               20  
  1        Base        2         GDP      M USD     30       35               50  
  1        Base        3         GDP      M USD     20       75               80  
  1        Stress 1    1         GDP      % diff    0.48    0.11             0.31  
  1        Stress 1    2         GDP      % diff    0.12    0.33             0.89  
  1        Stress 1    3         GDP      % diff    0.76    0.54             0.08  
  1        Stress 2    1         GDP      % diff    0.37    0.94             0.13  
  1        Stress 2    2         GDP      % diff    0.73    0.76             0.35  
  1        Stress 2    3         GDP      % diff    0.15    0.45             0.37  
  1        Stress 3    1         GDP      % diff    0.49    0.14             0.37  
  1        Stress 3    2         GDP      % diff    0.14    0.73             0.94  
  1        Stress 3    3         GDP      % diff    0.96    0.26             0.85

Obs: Each of the stress scenarios are % changes over the base scenario (for the same region and variable). That means they are the base value * 1 the stress value.

Obs2: The original dataframe has more models, scenarios, regions and variables, but they are always the same (all models have the same set of scenarios, all scenarios have the same set of regions and so on).

Objective: Is to have the values for each row to be in the same unit as baseline. For that I need to perform the multiplication mentioned above.

The formula would be:

Model    Scenario    ...    Year1          Year2    ...     Year50
  1      Stress 1        10*(1 0.48)    15*(1 0.11)        20*(1 0.31)

The output would be:

Model    Scenario    ...    Year1          Year2    ...     Year50
  1      Stress 1           14.8           16.65             26.2

What I already tried: I am trying to use the df.loc function to find the matching values and use them to do the calculation:

test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] = test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Base') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))] * (1   test_df.loc[((test_df['Model'] == '1') & (test_df['Scenario'] == 'Stress1') & (test_df['Region'] == "1") & (test_df['Variable'] == 'GDP'))])

Obs3: I can see 2 issues with this approach: First that I am not properly able to control the values on the "Year" columns. Second is that I am unsure how to perform this operation at once for the whole dataframe without having to create one line for each possible combination of model/scenario/region/variable available.

Question: Is there a way to perform this operation? If so, what would be the best way?

CodePudding user response：

First filter Base DataFrame - in sample data is possible use only basic condition test_df['Scenario'] == 'Base' and convert columns used for correct align anothr DataFrame - here 'Model','Region','Variable', Scenario, Unit is different, so omitted and also are filtered columns with list:

years = [Year1,Year2,Year50]
df1 = (test_df[(test_df['Scenario'] == 'Base')]
             .set_index(['Model','Region','Variable'])[years])
print (df1)
                       Year1  Year2  Year50
Model Region Variable                      
1     1      GDP        10.0   15.0    20.0
      2      GDP        30.0   35.0    50.0
      3      GDP        20.0   75.0    80.0

Similar way is used for df2:

df2 = (test_df[(test_df['Unit'] == '% diff')]
             .set_index(['Model','Scenario','Region','Variable','Unit'])[years])
print (df2)
                                       Year1  Year2  Year50
Model Scenario Region Variable Unit                        
1     Stress 1 1      GDP      % diff   0.48   0.11    0.31
               2      GDP      % diff   0.12   0.33    0.89
               3      GDP      % diff   0.76   0.54    0.08
      Stress 2 1      GDP      % diff   0.37   0.94    0.13
               2      GDP      % diff   0.73   0.76    0.35
               3      GDP      % diff   0.15   0.45    0.37
      Stress 3 1      GDP      % diff   0.49   0.14    0.37
               2      GDP      % diff   0.14   0.73    0.94
               3      GDP      % diff   0.96   0.26    0.85

Because match some levels in df1.index with df2.index and also are unique index in df1 is possible after add 1 to df1 multiple by df2:

df = df2.add(1).mul(df1).reset_index()
print (df)
   Model  Region Variable  Scenario    Unit  Year1   Year2  Year50
0      1       1      GDP  Stress 1  % diff   14.8   16.65    26.2
1      1       1      GDP  Stress 2  % diff   13.7   29.10    22.6
2      1       1      GDP  Stress 3  % diff   14.9   17.10    27.4
3      1       2      GDP  Stress 1  % diff   33.6   46.55    94.5
4      1       2      GDP  Stress 2  % diff   51.9   61.60    67.5
5      1       2      GDP  Stress 3  % diff   34.2   60.55    97.0
6      1       3      GDP  Stress 1  % diff   35.2  115.50    86.4
7      1       3      GDP  Stress 2  % diff   23.0  108.75   109.6
8      1       3      GDP  Stress 3  % diff   39.2   94.50   148.0