Good day everyone
Suppose that I have these annual datasets (i.e., January 1st to December 31st for multiple years) that show how much energy is being produced by various generators in a certain area. The data looks something like the table below. The table only has four generators (i.e., generators A, B, C and D) and only covers year 2022, but please keep in mind that the actual dataset has A LOT more than four generators and covers multiple years.
Date | Generator | Type | Period | Capacity | Production |
---|---|---|---|---|---|
2022-01-01 | A | Hydro | 1 | 12 | 10 |
2022-01-01 | A | Hydro | 2 | 12 | 8 |
2022-01-01 | A | Hydro | 3 | 12 | 5 |
2022-01-01 | A | Hydro | 4 | 12 | 11 |
2022-01-01 | B | Nuclear | 1 | 10 | 10 |
2022-01-01 | B | Nuclear | 2 | 10 | 7 |
2022-01-01 | B | Nuclear | 3 | 10 | 4 |
2022-01-01 | B | Nuclear | 4 | 10 | 10 |
2022-01-01 | C | Wind | 1 | 15 | 12 |
2022-01-01 | C | Wind | 2 | 15 | 9 |
2022-01-01 | C | Wind | 3 | 15 | 7 |
2022-01-01 | C | Wind | 4 | 15 | 14 |
2022-01-01 | D | Hydro | 1 | 11 | 8 |
2022-01-01 | D | Hydro | 2 | 11 | 6 |
2022-01-01 | D | Hydro | 3 | 11 | 3 |
2022-01-01 | D | Hydro | 4 | 11 | 9 |
2022-01-02 | A | Hydro | 1 | 12 | 9.5 |
2022-01-02 | A | Hydro | 2 | 12 | 9 |
2022-01-02 | A | Hydro | 3 | 12 | 4 |
2022-01-02 | A | Hydro | 4 | 12 | 11 |
2022-01-02 | B | Nuclear | 1 | 10 | 10 |
2022-01-02 | B | Nuclear | 2 | 10 | 8 |
2022-01-02 | B | Nuclear | 3 | 10 | 4 |
2022-01-02 | B | Nuclear | 4 | 10 | 10 |
2022-01-02 | C | Wind | 1 | 15 | 11 |
2022-01-02 | C | Wind | 2 | 15 | 8 |
2022-01-02 | C | Wind | 3 | 15 | 7 |
2022-01-02 | C | Wind | 4 | 15 | 13 |
2022-01-02 | D | Hydro | 1 | 11 | 7.5 |
2022-01-02 | D | Hydro | 2 | 11 | 7 |
2022-01-02 | D | Hydro | 3 | 11 | 2 |
2022-01-02 | D | Hydro | 4 | 11 | 9 |
. | . | . | . | . | . |
. | . | . | . | . | . |
. | . | . | . | . | . |
2022-12-31 | B | Nuclear | 4 | 80 | 73 |
2022-12-31 | C | Wind | 1 | 15 | 6 |
2022-12-31 | C | Wind | 2 | 15 | 3 |
2022-12-31 | C | Wind | 3 | 15 | 2 |
2022-12-31 | C | Wind | 4 | 15 | 8 |
2022-12-31 | D | Hydro | 1 | 11 | 6 |
2022-12-31 | D | Hydro | 2 | 11 | 6 |
2022-12-31 | D | Hydro | 3 | 11 | 2 |
2022-12-31 | D | Hydro | 4 | 11 | 8 |
I want to create another column called "Difference" that shows the difference between the capacity and production for each time period and generator. This is quite simple for wind or nuclear generators since I can just do capacity minus production for each time period.
For each hydro unit, however, the process is not that simple due to unique nature of hydro-electric generators. I first have to take the difference between capacity and production in each period and multiply that by generator's capacity minus production for the year (i.e., sum of generator's capacity for the year minus sum of generator's production).
For example, if I wanted to calculate hydro generator A's "difference" on 2022-01-01 at Period 1, I would do:
(Capacity at period 1 - Production at period 1) multiply by (2022 sum of Generator A's Capacity - 2022 sum of Generator A's Production). So it would be: (12-10) multiply by (Z - X), where "Z" = 2022 sum of A's capacity and "X" = 2022 sum of A's production. If I was doing this for 2021 dates, then I would have to use the 2021 sum values (I hope I'm making sense).
Now, I can easily do this for wind or nuclear generators using "mutate" function. But I am not sure how I'm gonna do this for hydro units. So it would be great if anyone could assist me with this! Below is the sample data frame that you can use to answer my question. Please note that the sample data frame below only covers January 1, 2022 to January 2, 2022, but the actual dataset covers January 1 to December 31 for multiple years.
df <- data.frame(date=as.Date(c('1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
'1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
'1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
'1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
'1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
'1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
'1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
'1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022'), '%m/%d/%Y'),
Generator = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D',
'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D'),
Type = c('Hydro', 'Hydro', 'Hydro', 'Hydro',
"Nuclear", "Nuclear", "Nuclear", "Nuclear",
"Wind", "Wind", "Wind", "Wind",
'Hydro', 'Hydro', 'Hydro', 'Hydro',
'Hydro', 'Hydro', 'Hydro', 'Hydro',
"Nuclear","Nuclear", "Nuclear", "Nuclear",
"Wind", "Wind", "Wind", "Wind",
'Hydro', 'Hydro', 'Hydro', 'Hydro'),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Capacity = c(12, 12, 12, 12, 10, 10, 10, 10, 15, 15, 15, 15, 11, 11, 11, 11,
12, 12, 12, 12, 10, 10, 10, 10, 15, 15, 15, 15, 11, 11, 11, 11),
Production = c(10, 8, 5, 11, 10, 7, 4, 10, 12, 9, 7, 14, 8, 6, 3, 9,
9.5, 9, 4, 11, 10, 8, 4, 10, 11, 8, 7, 13, 7.5, 7, 2, 9))
At the end, the final output should look something like:
Date | Generator | Type | Period | Capacity | Production | Difference |
---|---|---|---|---|---|---|
2022-01-01 | A | Hydro | 1 | 12 | 10 | 57 |
2022-01-01 | A | Hydro | 2 | 12 | 8 | 114 |
2022-01-01 | A | Hydro | 3 | 12 | 5 | 199.5 |
2022-01-01 | A | Hydro | 4 | 12 | 11 | 28.5 |
2022-01-01 | B | Nuclear | 1 | 10 | 10 | 0 |
2022-01-01 | B | Nuclear | 2 | 10 | 7 | 3 |
2022-01-01 | B | Nuclear | 3 | 10 | 4 | 6 |
2022-01-01 | B | Nuclear | 4 | 10 | 10 | 0 |
2022-01-01 | C | Wind | 1 | 15 | 12 | 3 |
2022-01-01 | C | Wind | 2 | 15 | 9 | 6 |
2022-01-01 | C | Wind | 3 | 15 | 7 | 8 |
2022-01-01 | C | Wind | 4 | 15 | 14 | 1 |
2022-01-01 | D | Hydro | 1 | 11 | 8 | 109.5 |
2022-01-01 | D | Hydro | 2 | 11 | 6 | 182.5 |
2022-01-01 | D | Hydro | 3 | 11 | 3 | 292 |
2022-01-01 | D | Hydro | 4 | 11 | 9 | 73 |
2022-01-02 | A | Hydro | 1 | 12 | 9.5 | 71.25 |
2022-01-02 | A | Hydro | 2 | 12 | 9 | 85.5 |
2022-01-02 | A | Hydro | 3 | 12 | 4 | 228 |
2022-01-02 | A | Hydro | 4 | 12 | 11 | 28.5 |
2022-01-02 | B | Nuclear | 1 | 10 | 10 | 0 |
2022-01-02 | B | Nuclear | 2 | 10 | 8 | 2 |
2022-01-02 | B | Nuclear | 3 | 10 | 4 | 6 |
2022-01-02 | B | Nuclear | 4 | 10 | 10 | 0 |
2022-01-02 | C | Wind | 1 | 15 | 11 | 4 |
2022-01-02 | C | Wind | 2 | 15 | 8 | 7 |
2022-01-02 | C | Wind | 3 | 15 | 7 | 8 |
2022-01-02 | C | Wind | 4 | 15 | 13 | 2 |
2022-01-02 | D | Hydro | 1 | 11 | 7.5 | 127.75 |
2022-01-02 | D | Hydro | 2 | 11 | 7 | 146 |
2022-01-02 | D | Hydro | 3 | 11 | 2 | 328.5 |
2022-01-02 | D | Hydro | 4 | 11 | 9 | 73 |
Again, (sorry for being so repetitive), the actual dataset covers multiple years and has more generators, so the codes/solution should be as "general/flexible" as possible and be able to be used for much larger datasets. That is, the solution shouldn't be too "rigid" to the point that it can ONLY be used to address the example above. IF doing this for multiple years is challenging, then we can just focus on a specific year for now.
Thank you
CodePudding user response:
If I understand it right the grouping should be per year (and Generator).
One approach is using dplyr
s case_when
library(dplyr)
df %>%
group_by(year = strftime(date, format="%Y"), Generator) %>%
mutate(Difference = case_when(
Type == "Hydro" ~ (Capacity - Production) * (sum(Capacity) - sum(Production)),
Type == "Nuclear" | Type == "Wind" ~ Capacity - Production)) %>%
ungroup() %>%
select(-year)
# A tibble: 32 × 7
date Generator Type Period Capacity Production Difference
<date> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2022-01-01 A Hydro 1 12 10 57
2 2022-01-01 A Hydro 2 12 8 114
3 2022-01-01 A Hydro 3 12 5 200.
4 2022-01-01 A Hydro 4 12 11 28.5
5 2022-01-01 B Nuclear 1 10 10 0
6 2022-01-01 B Nuclear 2 10 7 3
7 2022-01-01 B Nuclear 3 10 4 6
8 2022-01-01 B Nuclear 4 10 10 0
9 2022-01-01 C Wind 1 15 12 3
10 2022-01-01 C Wind 2 15 9 6
# … with 22 more rows