I am trying to replicate a excel model I have in python to automate it as I scale it up but I am stuck on how to translate the complex formula's into python.
I have information in three dataframes:
DF1:
ID type 1 | ID type 2 | Unit |
---|---|---|
a | 1_a | 400 |
b | 1_b | 26 |
c | 1_c | 23 |
d | 1_b | 45 |
e | 1_d | 24 |
f | 1_b | 85 |
g | 1_a | 98 |
DF2:
ID type 1 | ID type 2 | Tech |
---|---|---|
a | 1_a | wind |
b | 1_b | solar |
c | 1_c | gas |
d | 1_b | coal |
e | 1_d | wind |
f | 1_b | gas |
g | 1_a | coal |
And DF 3, the main DF:
Date | Time | ID type 1 | ID type 2 | Period | output | Unit * | Tech * |
---|---|---|---|---|---|---|---|
03/01/2022 | 02:30:00 | a | 1_a | 1 | 254 | ||
03/01/2022 | 02:30:00 | b | 1_b | 1 | 456 | ||
03/01/2022 | 02:30:00 | c | 1_c | 2 | 3325 | ||
03/01/2022 | 02:30:00 | d | 1_b | 2 | 1254 | ||
05/01/2022 | 02:30:00 | e | 1_d | 3 | 489 | ||
05/01/2022 | 02:30:00 | a | 1_a | 3 | 452 | ||
05/01/2022 | 02:30:00 | b | 1_b | 4 | 12 | ||
05/01/2022 | 02:30:00 | c | 1_c | 4 | 1 | ||
05/01/2022 | 03:00:00 | d | 1_b | 35 | 54 | ||
05/01/2022 | 03:00:00 | e | 1_d | 35 | 48 | ||
05/01/2022 | 03:00:00 | a | 1_a | 48 | 56 |
I wish to get the information from each ID type in DF 3 for "unit" and "Tech" from DF 1 & 2 into DF 3. The conditional statements I have in excel atm are based on INDEX and MATCH and INFA, as some of the ID types in DF will be from either ID type 1 or ID type 2 so the function checks both columns and based on a positve match yields the required result.
For more context, DF1 and DF2 do not change but DF3 changes and I need a function for that which I will explain later.
The excel function I use to fill in Unit* from DF1 is (note I have replaced the excel sheet name to DF1 to help conceptualize the problem:
=IFNA(INDEX('DF1'!$K$3:$K$1011,MATCH(N2,'DF1'!$E$3:$E$1011,0)),INDEX('DF1'!$K$3:$K$1011,MATCH(M2,'DF1'!$D$3:$D$1011,0)))
The excel function I use to fill in Tech * is a bit more straight forward:
=IFNA(INDEX('DF2'$L:$L,MATCH(O3,'DF2'$K:$K,0)),INDEX('DF2'$L:$L,MATCH(N3,'DF2'$J:$J,0)))
That is the main stumbling block at the moment, but after this is achieved I need a function that for each day produces the following DF:
ID type 1 | Tech | Period 1 | Period 2 | Period 3 | Period 4 | Period 5 | Period 6 | Period 7 | … |
---|---|---|---|---|---|---|---|---|---|
a | wind | Sum of output for this ID Type 1 and Period 1 | |||||||
b | solar | ||||||||
c | gas | ||||||||
d | coal | ||||||||
e | wind | ||||||||
a | gas | ||||||||
… | … |
The idea here is that I can use conditional function again to sum the "output" column of DF3 under the condition of date, ID type and period number.
EDIT: Output based on possible solution:
time settlementDate BM Unit ID 1 BM Unit ID 2 settlementPeriod \
0 00:00:00 03/01/2022 RCBKO-1 T_RCBKO-1 1
1 00:00:00 03/01/2022 LARYO-3 T_LARYW-3 1
2 00:00:00 03/01/2022 LAGA-1 T_LAGA-1 1
3 00:00:00 03/01/2022 CRMLW-1 T_CRMLW-1 1
4 00:00:00 03/01/2022 GRIFW-1 T_GRIFW-1 1
... ... ... ... ... ...
52533 23:30:00 08/01/2022 CRMLW-1 T_CRMLW-1 48
52534 23:30:00 08/01/2022 LARYO-4 T_LARYW-4 48
52535 23:30:00 08/01/2022 HOWBO-3 T_HOWBO-3 48
52536 23:30:00 08/01/2022 BETHW-1 E_BETHW-1 48
52537 23:30:00 08/01/2022 HMGTO-1 T_HMGTO-1 48
quantity Capacity_x Technology Technology_x \
0 278.658 NaN NaN WIND
1 162.940 NaN NaN WIND
2 262.200 NaN NaN CCGT
3 3.002 NaN NaN WIND
4 9.972 NaN NaN WIND
... ... ... ... ...
52533 8.506 NaN NaN WIND
52534 159.740 NaN NaN WIND
52535 32.554 NaN NaN NaN
52536 5.010 NaN NaN WIND
52537 92.094 NaN NaN WIND
Registered Resource Name_x Capacity_y Technology_y \
0 NaN NaN WIND
1 NaN NaN WIND
2 NaN NaN CCGT
3 NaN NaN WIND
4 NaN NaN WIND
... ... ... ...
52533 NaN NaN WIND
52534 NaN NaN WIND
52535 NaN NaN NaN
52536 NaN NaN WIND
52537 NaN NaN WIND
Registered Resource Name_y Capacity
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
52533 NaN NaN
52534 NaN NaN
52535 NaN NaN
52536 NaN NaN
52537 NaN NaN
[52538 rows x 14 columns]
CodePudding user response:
For the first part of your question you want to do a left merge on those 2 columns twice like this:
df3 = (
df3
.merge(df1, on=['ID type 1', 'ID type 2'], how='left')
.merge(df2, on=['ID type 1', 'ID type 2'], how='left')
)
print(df3)
Date Time ID type 1 ID type 2 Period output Unit Tech
0 03/01/2022 02:30:00 a 1_a 1 254 400 wind
1 03/01/2022 02:30:00 b 1_b 1 456 26 solar
2 03/01/2022 02:30:00 c 1_c 2 3325 23 gas
3 03/01/2022 02:30:00 d 1_b 2 1254 45 coal
4 05/01/2022 02:30:00 e 1_d 3 489 24 wind
5 05/01/2022 02:30:00 a 1_a 3 452 400 wind
6 05/01/2022 02:30:00 b 1_b 4 12 26 solar
7 05/01/2022 02:30:00 c 1_c 4 1 23 gas
8 05/01/2022 03:00:00 d 1_b 35 54 45 coal
9 05/01/2022 03:00:00 e 1_d 35 48 24 wind
10 05/01/2022 03:00:00 a 1_a 48 56 400 wind
For the next part you could use a pandas.pivot_table
.
out = (
df3
.pivot_table(
index=['Date', 'ID type 1', 'Tech'],
columns='Period',
values='output',
aggfunc=sum,
fill_value=0)
.add_prefix('Period_')
)
print(out)
Output:
Period Period_1 Period_2 Period_3 Period_4 Period_35 Period_48
Date ID type 1 Tech
03/01/2022 a wind 254 0 0 0 0 0
b solar 456 0 0 0 0 0
c gas 0 3325 0 0 0 0
d coal 0 1254 0 0 0 0
05/01/2022 a wind 0 0 452 0 0 56
b solar 0 0 0 12 0 0
c gas 0 0 0 1 0 0
d coal 0 0 0 0 54 0
e wind 0 0 489 0 48 0
I used fill_value to show you that option, without it you get 'NaN' in those cells.