I am currently working with data which has a structure like the following
ID | Var1 | Var2 | Var3 | Var4 | Var5 | Var6 |
---|---|---|---|---|---|---|
1 | 1001 | 1001 | 1001 | 2002 | 2002 | 2002 |
2 | 2002 | 1001 | 1001 | 1001 | NA | 9999 |
3 | 1001 | NA | 1001 | 1001 | 9999 | 1234 |
4 | 3003 | 3003 | 3003 | NA | 2002 | 2002 |
5 | NA | NA | NA | 1001 | 1001 | 1001 |
6 | 3003 | 3003 | 3003 | 3003 | 3003 | 3003 |
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6),
Var1 = c(1001, 2002, 1001, 3003, NA, 3003),
Var2 = c(1001, 1001, NA, 3003, NA, 3003),
Var3 = c(1001, 1001, 1001, 3003, NA, 3003),
Var4 = c(2002, 1001, 1001, NA, 1001, 3003),
Var5 = c(2002, NA, 9999, 2002, 1001, 3003),
Var6 = c(2002, 9999, 1234, 2002, 1001, 3003))
My goal is to count—for each row—how many times a unique value has been duplicated consecutively at least n times. Let's say n = 3. So, if "1001" repeats rowwise in at least 3 adjacent cells (could be more), then that counts as one duplicate. If both "1001" and "2002" have been repeated in a row at least three times each consecutively, then that would be two duplicates, etc. The following would be the intended result:
ID | Var1 | Var2 | Var3 | Var4 | Var5 | Var6 | Num_3ConsecutiveDuplications |
---|---|---|---|---|---|---|---|
1 | 1001 | 1001 | 1001 | 2002 | 2002 | 2002 | 2 |
2 | 2002 | 1001 | 1001 | 1001 | NA | 9999 | 1 |
3 | 1001 | NA | 1001 | 1001 | 9999 | 1234 | 0 |
4 | 3003 | 3003 | 3003 | NA | 2002 | 2002 | 1 |
5 | NA | NA | NA | 1001 | 1001 | 1001 | 1 |
6 | 3003 | 3003 | 3003 | 3003 | 3003 | 3003 | 1 |
Since in the actual data there are a significant range of possible values each variable can take on, and that there are over 40 variables, counting this by conditioning on each possible value and range of columns seems infeasible/inefficient. Any suggestions would be greatly appreciated.
CodePudding user response:
You can use rle()
:
library(dplyr)
df %>%
rowwise() %>%
mutate(x = sum(rle(across(-ID))$lengths >= 3))
# A tibble: 6 x 8
# Rowwise:
ID Var1 Var2 Var3 Var4 Var5 Var6 x
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1001 1001 1001 2002 2002 2002 2
2 2 2002 1001 1001 1001 NA 9999 1
3 3 1001 NA 1001 1001 9999 1234 0
4 4 3003 3003 3003 NA 2002 2002 1
5 5 NA NA NA 1001 1001 1001 1
6 6 3003 3003 3003 3003 3003 3003 1