Determining whether a value has been consecutively duplicated n times-CodePudding

I am currently working with data which has a structure like the following

ID	Year1	Year2	Year3	Year4	Year5	Year6
1	2002	2002	2002	2002	2002	NA
2	2002	NA	1001	1001	9999	9999
3	2002	NA	1001	1001	1001	NA
4	3003	3003	3003	NA	NA	NA
5	1001	1001	1001	1001	NA	NA
6	9999	9999	9999	1001	9999	9999

df <- data.frame(ID = c(1, 2, 3, 4, 5, 6),
                 Year1 = c(2002, 2002, 2002, 3003,  1001, 9999),
                 Year2 = c(2002,   NA,   NA, 3003,  1001, 9999),
                 Year3 = c(2002, 1001, 1001, 3003,  1001, 9999),
                 Year4 = c(2002, 1001, 1001,   NA,  1001, 1001),
                 Year5 = c(2002, 9999, 1001,   NA,    NA, 9999),
                 Year6 = c(NA,   9999,   NA,   NA,    NA, 9999))

My goal is to construct an additional variable duplicated_from_end which indicates whether, for each observation, the most recent non-missing value (i.e., the non-missing value with the highest year) has been duplicated at least n times (say, n = 3 in this example) consecutively. For instance, for the observation with ID = 1, the most recent non-missing value is 2002 in Year5, which (going backwards in years) is consecutively duplicated in Year4 and Year3 as well, and hence I would want duplicated = TRUE. The overall desired behavior is described below:

ID	Year1	Year2	Year3	Year4	Year5	Year6	duplicated_from_end
1	2002	2002	2002	2002	2002	NA	TRUE
2	2002	NA	1001	1001	9999	9999	FALSE
3	2002	NA	1001	1001	1001	NA	TRUE
4	3003	3003	3003	NA	NA	NA	TRUE
5	1001	1001	1001	1001	NA	NA	TRUE
6	9999	9999	9999	1001	9999	9999	FALSE

I have attempted solutions with rle() which can count consecutive duplicates, although I am unable to determine how to condition on only counting duplicates beginning with the latest non-missing variable. Any suggestions would be greatly appreciated.

CodePudding user response：

Consider using rle by row - loop over the rows with apply (MARGIN = 1), apply the run-length-encoding (rle) on the data (without the first column), extract the 'values' and corresponding 'lengths' from the list output. Subset the values that are not NA (!is.na), and check the last element (tail) length is greater than or equal to 3

df$duplicated_from_end <- apply(df[-1], 1, function(x) {
    rl <- rle(x)
     v1 <- rl$values 
    l1 <- rl$lengths
    i1 <- !is.na(v1)
    tail(l1[i1], 1) >=3 })

-output

df$duplicated_from_end
[1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE

Or slighly more compact would be to reverse arrangement of columns, and use rleid

library(data.table)
apply(df[ncol(df):2], 1, \(x) table(rleid(x)[!is.na(x)])[1] >=3)
[1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE