I am currently working with data which has a structure like the following
ID | Year1 | Year2 | Year3 | Year4 | Year5 | Year6 |
---|---|---|---|---|---|---|
1 | 2002 | 2002 | 2002 | 2002 | 2002 | NA |
2 | 2002 | NA | 1001 | 1001 | 9999 | 9999 |
3 | 2002 | NA | 1001 | 1001 | 1001 | NA |
4 | 3003 | 3003 | 3003 | NA | NA | NA |
5 | 1001 | 1001 | 1001 | 1001 | NA | NA |
6 | 9999 | 9999 | 9999 | 1001 | 9999 | 9999 |
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6),
Year1 = c(2002, 2002, 2002, 3003, 1001, 9999),
Year2 = c(2002, NA, NA, 3003, 1001, 9999),
Year3 = c(2002, 1001, 1001, 3003, 1001, 9999),
Year4 = c(2002, 1001, 1001, NA, 1001, 1001),
Year5 = c(2002, 9999, 1001, NA, NA, 9999),
Year6 = c(NA, 9999, NA, NA, NA, 9999))
My goal is to construct an additional variable duplicated_from_end
which indicates whether, for each observation, the most recent non-missing value (i.e., the non-missing value with the highest year) has been duplicated at least n times (say, n = 3 in this example) consecutively. For instance, for the observation with ID = 1
, the most recent non-missing value is 2002 in Year5
, which (going backwards in years) is consecutively duplicated in Year4
and Year3
as well, and hence I would want duplicated = TRUE
. The overall desired behavior is described below:
ID | Year1 | Year2 | Year3 | Year4 | Year5 | Year6 | duplicated_from_end |
---|---|---|---|---|---|---|---|
1 | 2002 | 2002 | 2002 | 2002 | 2002 | NA | TRUE |
2 | 2002 | NA | 1001 | 1001 | 9999 | 9999 | FALSE |
3 | 2002 | NA | 1001 | 1001 | 1001 | NA | TRUE |
4 | 3003 | 3003 | 3003 | NA | NA | NA | TRUE |
5 | 1001 | 1001 | 1001 | 1001 | NA | NA | TRUE |
6 | 9999 | 9999 | 9999 | 1001 | 9999 | 9999 | FALSE |
I have attempted solutions with rle()
which can count consecutive duplicates, although I am unable to determine how to condition on only counting duplicates beginning with the latest non-missing variable. Any suggestions would be greatly appreciated.
CodePudding user response:
Consider using rle
by row - loop over the rows with apply
(MARGIN = 1
), apply the run-length-encoding (rle
) on the data (without the first column), extract the 'values' and corresponding 'lengths' from the list
output. Subset the values that are not NA
(!is.na
), and check the last element (tail
) length is greater than or equal to 3
df$duplicated_from_end <- apply(df[-1], 1, function(x) {
rl <- rle(x)
v1 <- rl$values
l1 <- rl$lengths
i1 <- !is.na(v1)
tail(l1[i1], 1) >=3 })
-output
df$duplicated_from_end
[1] TRUE FALSE TRUE TRUE TRUE FALSE
Or slighly more compact would be to reverse arrangement of columns, and use rleid
library(data.table)
apply(df[ncol(df):2], 1, \(x) table(rleid(x)[!is.na(x)])[1] >=3)
[1] TRUE FALSE TRUE TRUE TRUE FALSE