Home > front end >  How to replicate index/match (with multiple criteria) in in python with multiple dataframes?
How to replicate index/match (with multiple criteria) in in python with multiple dataframes?

Time:10-29

I am trying to replicate a excel model I have in python to automate it as I scale it up but I am stuck on how to translate the complex formula's into python.

I have information in three dataframes:

DF1:

ID type 1 ID type 2 Unit
a 1_a 400
b 1_b 26
c 1_c 23
d 1_b 45
e 1_d 24
f 1_b 85
g 1_a 98

DF2:

ID type 1 ID type 2 Tech
a 1_a wind
b 1_b solar
c 1_c gas
d 1_b coal
e 1_d wind
f 1_b gas
g 1_a coal

And DF 3, the main DF:

Date Time ID type 1 ID type 2 Period output Unit * Tech *
03/01/2022 02:30:00 a 1_a 1 254
03/01/2022 02:30:00 b 1_b 1 456
03/01/2022 02:30:00 c 1_c 2 3325
03/01/2022 02:30:00 d 1_b 2 1254
05/01/2022 02:30:00 e 1_d 3 489
05/01/2022 02:30:00 a 1_a 3 452
05/01/2022 02:30:00 b 1_b 4 12
05/01/2022 02:30:00 c 1_c 4 1
05/01/2022 03:00:00 d 1_b 35 54
05/01/2022 03:00:00 e 1_d 35 48
05/01/2022 03:00:00 a 1_a 48 56

I wish to get the information from each ID type in DF 3 for "unit" and "Tech" from DF 1 & 2 into DF 3. The conditional statements I have in excel atm are based on INDEX and MATCH and INFA, as some of the ID types in DF will be from either ID type 1 or ID type 2 so the function checks both columns and based on a positve match yields the required result.

For more context, DF1 and DF2 do not change but DF3 changes and I need a function for that which I will explain later.

The excel function I use to fill in Unit* from DF1 is (note I have replaced the excel sheet name to DF1 to help conceptualize the problem:

=IFNA(INDEX('DF1'!$K$3:$K$1011,MATCH(N2,'DF1'!$E$3:$E$1011,0)),INDEX('DF1'!$K$3:$K$1011,MATCH(M2,'DF1'!$D$3:$D$1011,0)))

The excel function I use to fill in Tech * is a bit more straight forward:

=IFNA(INDEX('DF2'$L:$L,MATCH(O3,'DF2'$K:$K,0)),INDEX('DF2'$L:$L,MATCH(N3,'DF2'$J:$J,0)))

That is the main stumbling block at the moment, but after this is achieved I need a function that for each day produces the following DF:

ID type 1 Tech Period 1 Period 2 Period 3 Period 4 Period 5 Period 6 Period 7
a wind Sum of output for this ID Type 1 and Period 1
b solar
c gas
d coal
e wind
a gas

The idea here is that I can use conditional function again to sum the "output" column of DF3 under the condition of date, ID type and period number.

EDIT: Output based on possible solution:

           time settlementDate BM Unit ID 1 BM Unit ID 2  settlementPeriod  \
0      00:00:00     03/01/2022      RCBKO-1    T_RCBKO-1                 1   
1      00:00:00     03/01/2022      LARYO-3    T_LARYW-3                 1   
2      00:00:00     03/01/2022       LAGA-1     T_LAGA-1                 1   
3      00:00:00     03/01/2022      CRMLW-1    T_CRMLW-1                 1   
4      00:00:00     03/01/2022      GRIFW-1    T_GRIFW-1                 1   
...         ...            ...          ...          ...               ...   
52533  23:30:00     08/01/2022      CRMLW-1    T_CRMLW-1                48   
52534  23:30:00     08/01/2022      LARYO-4    T_LARYW-4                48   
52535  23:30:00     08/01/2022      HOWBO-3    T_HOWBO-3                48   
52536  23:30:00     08/01/2022      BETHW-1    E_BETHW-1                48   
52537  23:30:00     08/01/2022      HMGTO-1    T_HMGTO-1                48   

       quantity  Capacity_x  Technology  Technology_x  \
0       278.658         NaN          NaN         WIND   
1       162.940         NaN          NaN         WIND   
2       262.200         NaN          NaN         CCGT   
3         3.002         NaN          NaN         WIND   
4         9.972         NaN          NaN         WIND   
...         ...         ...          ...          ...   
52533     8.506         NaN          NaN         WIND   
52534   159.740         NaN          NaN         WIND   
52535    32.554         NaN          NaN          NaN   
52536     5.010         NaN          NaN         WIND   
52537    92.094         NaN          NaN         WIND   

      Registered Resource Name_x  Capacity_y Technology_y  \
0                            NaN         NaN         WIND   
1                            NaN         NaN         WIND   
2                            NaN         NaN         CCGT   
3                            NaN         NaN         WIND   
4                            NaN         NaN         WIND   
...                          ...         ...          ...   
52533                        NaN         NaN         WIND   
52534                        NaN         NaN         WIND   
52535                        NaN         NaN          NaN   
52536                        NaN         NaN         WIND   
52537                        NaN         NaN         WIND   

      Registered Resource Name_y  Capacity  
0                            NaN       NaN  
1                            NaN       NaN  
2                            NaN       NaN  
3                            NaN       NaN  
4                            NaN       NaN  
...                          ...       ...  
52533                        NaN       NaN  
52534                        NaN       NaN  
52535                        NaN       NaN  
52536                        NaN       NaN  
52537                        NaN       NaN  

[52538 rows x 14 columns]

CodePudding user response:

For the first part of your question you want to do a left merge on those 2 columns twice like this:

df3 = (
    df3
    .merge(df1, on=['ID type 1', 'ID type 2'], how='left')
    .merge(df2, on=['ID type 1', 'ID type 2'], how='left')
)
print(df3)
          Date      Time ID type 1 ID type 2  Period  output  Unit   Tech
0   03/01/2022  02:30:00         a       1_a       1     254   400   wind
1   03/01/2022  02:30:00         b       1_b       1     456    26  solar
2   03/01/2022  02:30:00         c       1_c       2    3325    23    gas
3   03/01/2022  02:30:00         d       1_b       2    1254    45   coal
4   05/01/2022  02:30:00         e       1_d       3     489    24   wind
5   05/01/2022  02:30:00         a       1_a       3     452   400   wind
6   05/01/2022  02:30:00         b       1_b       4      12    26  solar
7   05/01/2022  02:30:00         c       1_c       4       1    23    gas
8   05/01/2022  03:00:00         d       1_b      35      54    45   coal
9   05/01/2022  03:00:00         e       1_d      35      48    24   wind
10  05/01/2022  03:00:00         a       1_a      48      56   400   wind

For the next part you could use a pandas.pivot_table.

out = (
    df3
    .pivot_table(
        index=['Date', 'ID type 1', 'Tech'], 
        columns='Period', 
        values='output', 
        aggfunc=sum, 
        fill_value=0)
    .add_prefix('Period_')
)
print(out)

Output:

Period                      Period_1  Period_2  Period_3  Period_4  Period_35  Period_48
Date       ID type 1 Tech                                                               
03/01/2022 a         wind        254         0         0         0          0          0
           b         solar       456         0         0         0          0          0
           c         gas           0      3325         0         0          0          0
           d         coal          0      1254         0         0          0          0
05/01/2022 a         wind          0         0       452         0          0         56
           b         solar         0         0         0        12          0          0
           c         gas           0         0         0         1          0          0
           d         coal          0         0         0         0         54          0
           e         wind          0         0       489         0         48          0

I used fill_value to show you that option, without it you get 'NaN' in those cells.

  • Related