I have the following data frame and I would like to create the "OUTPUT_COLUMN".
Explanation of columns:
- ID is the identification number of the policy
- ID_REG_YEAR is the identification number per Registration Year
- CALENDAR_YEAR is the year that the policy have exposure
- NUMBER_OF_RENEWALS is the count of numbers that the policy has renewed
- ACCIDENT is accident occurred
KEY TO THE DATASET: ID_REG_YEAR and CALENDAR_YEAR
Basically, if column NUMBER_OF_RENEWALS = 0 then OUTPUT_COLUMN = 100. Any rows that an accident did not occurred before should contain 100 (e.g rows 13,16,17). If an Accident occured I would like to count the number of renewals until the next accident.
ID ID_REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN
1 A A_2015 2015 0 YES 100
2 A A_2015 2016 0 YES 100
3 A A_2016 2016 1 YES 0
4 A A_2016 2017 1 YES 0
5 A A_2017 2017 2 NO 1
6 A A_2017 2018 2 NO 1
7 A A_2018 2018 3 NO 2
8 A A_2018 2019 3 NO 2
9 A A_2019 2019 4 YES 0
10 A A_2019 2020 4 YES 0
11 B B_2015 2015 0 NO 100
12 B B_2015 2016 0 NO 100
13 B B_2016 2016 1 NO 100
14 C C_2013 2013 0 NO 100
15 C C_2013 2014 0 NO 100
16 C C_2014 2014 1 NO 100
17 C C_2014 2015 1 NO 100
18 C C_2015 2015 2 YES 0
19 C C_2015 2016 2 YES 0
20 C C_2016 2016 3 NO 1
21 C C_2016 2017 3 NO 1
22 C C_2017 2017 4 NO 2
23 C C_2017 2018 4 NO 2
24 C C_2018 2018 5 YES 0
25 C C_2018 2019 5 YES 0
26 C C_2019 2019 6 NO 1
27 C C_2019 2020 6 NO 1
28 C C_2020 2020 7 NO 2
CodePudding user response:
Here is a dplyr
solution. First, obtain a separate column for the registration year, which will be used to calculate renewals since prior accident (assumes this is years since last accident). Then, create a column to contain the year of the last accident after grouping by ID
. Using fill
this value will be propagated. The final outcome column will be set as either 100 (if no prior accident, or NUMBER_OF_RENEWALS
is zero) vs. the registration year - last accident year.
library(dplyr)
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT = ifelse(ACCIDENT == "YES", REG_YEAR, NA_integer_)) %>%
fill(LAST_ACCIDENT, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT) | NUMBER_OF_RENEWALS == 0, 100, REG_YEAR - LAST_ACCIDENT
))
Output
ID ID_REG REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN LAST_ACCIDENT OUTPUT_COLUMN_2
<chr> <chr> <int> <int> <int> <chr> <int> <int> <dbl>
1 A A 2015 2015 0 YES 100 2015 100
2 A A 2015 2016 0 YES 100 2015 100
3 A A 2016 2016 1 YES 0 2016 0
4 A A 2016 2017 1 YES 0 2016 0
5 A A 2017 2017 2 NO 1 2016 1
6 A A 2017 2018 2 NO 1 2016 1
7 A A 2018 2018 3 NO 2 2016 2
8 A A 2018 2019 3 NO 2 2016 2
9 A A 2019 2019 4 YES 0 2019 0
10 A A 2019 2020 4 YES 0 2019 0
# … with 18 more rows
Note: If you want to use your policy number (NUMBER_OF_RENEWALS
) and not go by the year, you can do something similar. Instead of adding a column with the last accident year, you can include the last accident policy. Then, your output column could reflect the policy number instead of year (to consider the possibility that one or more years could be skipped).
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT_POLICY = ifelse(ACCIDENT == "YES", NUMBER_OF_RENEWALS, NA_integer_)) %>%
fill(LAST_ACCIDENT_POLICY, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT_POLICY) | NUMBER_OF_RENEWALS == 0, 100, NUMBER_OF_RENEWALS - LAST_ACCIDENT_POLICY
))