Count the occurences of accidents until the next accidents-CodePudding

I have the following data frame and I would like to create the "OUTPUT_COLUMN".

Explanation of columns:

ID is the identification number of the policy
ID_REG_YEAR is the identification number per Registration Year
CALENDAR_YEAR is the year that the policy have exposure
NUMBER_OF_RENEWALS is the count of numbers that the policy has renewed
ACCIDENT is accident occurred

KEY TO THE DATASET: ID_REG_YEAR and CALENDAR_YEAR

Basically, if column NUMBER_OF_RENEWALS = 0 then OUTPUT_COLUMN = 100. Any rows that an accident did not occurred before should contain 100 (e.g rows 13,16,17). If an Accident occured I would like to count the number of renewals until the next accident.


   ID ID_REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN
1   A      A_2015          2015                  0      YES           100
2   A      A_2015          2016                  0      YES           100
3   A      A_2016          2016                  1      YES             0
4   A      A_2016          2017                  1      YES             0
5   A      A_2017          2017                  2       NO             1
6   A      A_2017          2018                  2       NO             1
7   A      A_2018          2018                  3       NO             2
8   A      A_2018          2019                  3       NO             2
9   A      A_2019          2019                  4      YES             0
10  A      A_2019          2020                  4      YES             0
11  B      B_2015          2015                  0       NO           100
12  B      B_2015          2016                  0       NO           100
13  B      B_2016          2016                  1       NO           100
14  C      C_2013          2013                  0       NO           100
15  C      C_2013          2014                  0       NO           100
16  C      C_2014          2014                  1       NO           100
17  C      C_2014          2015                  1       NO           100
18  C      C_2015          2015                  2      YES             0
19  C      C_2015          2016                  2      YES             0
20  C      C_2016          2016                  3       NO             1
21  C      C_2016          2017                  3       NO             1
22  C      C_2017          2017                  4       NO             2
23  C      C_2017          2018                  4       NO             2
24  C      C_2018          2018                  5      YES             0
25  C      C_2018          2019                  5      YES             0
26  C      C_2019          2019                  6       NO             1
27  C      C_2019          2020                  6       NO             1
28  C      C_2020          2020                  7       NO             2

CodePudding user response：

Here is a dplyr solution. First, obtain a separate column for the registration year, which will be used to calculate renewals since prior accident (assumes this is years since last accident). Then, create a column to contain the year of the last accident after grouping by ID. Using fill this value will be propagated. The final outcome column will be set as either 100 (if no prior accident, or NUMBER_OF_RENEWALS is zero) vs. the registration year - last accident year.

library(dplyr)

df %>%
  separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
  group_by(ID) %>%
  mutate(LAST_ACCIDENT = ifelse(ACCIDENT == "YES", REG_YEAR, NA_integer_)) %>%
  fill(LAST_ACCIDENT, .direction = "down") %>%
  mutate(OUTPUT_COLUMN_2 = ifelse(
    is.na(LAST_ACCIDENT) | NUMBER_OF_RENEWALS == 0, 100, REG_YEAR - LAST_ACCIDENT
  ))

Output

   ID    ID_REG REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN LAST_ACCIDENT OUTPUT_COLUMN_2
   <chr> <chr>     <int>         <int>              <int> <chr>            <int>         <int>           <dbl>
 1 A     A          2015          2015                  0 YES                100          2015             100
 2 A     A          2015          2016                  0 YES                100          2015             100
 3 A     A          2016          2016                  1 YES                  0          2016               0
 4 A     A          2016          2017                  1 YES                  0          2016               0
 5 A     A          2017          2017                  2 NO                   1          2016               1
 6 A     A          2017          2018                  2 NO                   1          2016               1
 7 A     A          2018          2018                  3 NO                   2          2016               2
 8 A     A          2018          2019                  3 NO                   2          2016               2
 9 A     A          2019          2019                  4 YES                  0          2019               0
10 A     A          2019          2020                  4 YES                  0          2019               0
# … with 18 more rows

Note: If you want to use your policy number (NUMBER_OF_RENEWALS) and not go by the year, you can do something similar. Instead of adding a column with the last accident year, you can include the last accident policy. Then, your output column could reflect the policy number instead of year (to consider the possibility that one or more years could be skipped).

df %>%
  separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
  group_by(ID) %>%
  mutate(LAST_ACCIDENT_POLICY = ifelse(ACCIDENT == "YES", NUMBER_OF_RENEWALS, NA_integer_)) %>%
  fill(LAST_ACCIDENT_POLICY, .direction = "down") %>%
  mutate(OUTPUT_COLUMN_2 = ifelse(
    is.na(LAST_ACCIDENT_POLICY) | NUMBER_OF_RENEWALS == 0, 100, NUMBER_OF_RENEWALS - LAST_ACCIDENT_POLICY
  ))