Home > Back-end >  Determining whether a value has been consecutively duplicated n times
Determining whether a value has been consecutively duplicated n times

Time:04-14

I am currently working with data which has a structure like the following

ID Year1 Year2 Year3 Year4 Year5 Year6
1 2002 2002 2002 2002 2002 NA
2 2002 NA 1001 1001 9999 9999
3 2002 NA 1001 1001 1001 NA
4 3003 3003 3003 NA NA NA
5 1001 1001 1001 1001 NA NA
6 9999 9999 9999 1001 9999 9999
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6),
                 Year1 = c(2002, 2002, 2002, 3003,  1001, 9999),
                 Year2 = c(2002,   NA,   NA, 3003,  1001, 9999),
                 Year3 = c(2002, 1001, 1001, 3003,  1001, 9999),
                 Year4 = c(2002, 1001, 1001,   NA,  1001, 1001),
                 Year5 = c(2002, 9999, 1001,   NA,    NA, 9999),
                 Year6 = c(NA,   9999,   NA,   NA,    NA, 9999))

My goal is to construct an additional variable duplicated_from_end which indicates whether, for each observation, the most recent non-missing value (i.e., the non-missing value with the highest year) has been duplicated at least n times (say, n = 3 in this example) consecutively. For instance, for the observation with ID = 1, the most recent non-missing value is 2002 in Year5, which (going backwards in years) is consecutively duplicated in Year4 and Year3 as well, and hence I would want duplicated = TRUE. The overall desired behavior is described below:

ID Year1 Year2 Year3 Year4 Year5 Year6 duplicated_from_end
1 2002 2002 2002 2002 2002 NA TRUE
2 2002 NA 1001 1001 9999 9999 FALSE
3 2002 NA 1001 1001 1001 NA TRUE
4 3003 3003 3003 NA NA NA TRUE
5 1001 1001 1001 1001 NA NA TRUE
6 9999 9999 9999 1001 9999 9999 FALSE

I have attempted solutions with rle() which can count consecutive duplicates, although I am unable to determine how to condition on only counting duplicates beginning with the latest non-missing variable. Any suggestions would be greatly appreciated.

CodePudding user response:

Consider using rle by row - loop over the rows with apply (MARGIN = 1), apply the run-length-encoding (rle) on the data (without the first column), extract the 'values' and corresponding 'lengths' from the list output. Subset the values that are not NA (!is.na), and check the last element (tail) length is greater than or equal to 3

df$duplicated_from_end <- apply(df[-1], 1, function(x) {
    rl <- rle(x)
     v1 <- rl$values 
    l1 <- rl$lengths
    i1 <- !is.na(v1)
    tail(l1[i1], 1) >=3 })

-output

df$duplicated_from_end
[1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE

Or slighly more compact would be to reverse arrangement of columns, and use rleid

library(data.table)
apply(df[ncol(df):2], 1, \(x) table(rleid(x)[!is.na(x)])[1] >=3)
[1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE
  • Related