Home > Software engineering >  Counting consecutive rowwise duplicates
Counting consecutive rowwise duplicates

Time:04-13

I am currently working with data which has a structure like the following

ID Var1 Var2 Var3 Var4 Var5 Var6
1 1001 1001 1001 2002 2002 2002
2 2002 1001 1001 1001 NA 9999
3 1001 NA 1001 1001 9999 1234
4 3003 3003 3003 NA 2002 2002
5 NA NA NA 1001 1001 1001
6 3003 3003 3003 3003 3003 3003
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6),
                 Var1 = c(1001, 2002, 1001, 3003,   NA, 3003),
                 Var2 = c(1001, 1001,   NA, 3003,   NA, 3003),
                 Var3 = c(1001, 1001, 1001, 3003,   NA, 3003),
                 Var4 = c(2002, 1001, 1001,   NA, 1001, 3003),
                 Var5 = c(2002,   NA, 9999, 2002, 1001, 3003),
                 Var6 = c(2002, 9999, 1234, 2002, 1001, 3003))

My goal is to count—for each row—how many times a unique value has been duplicated consecutively at least n times. Let's say n = 3. So, if "1001" repeats rowwise in at least 3 adjacent cells (could be more), then that counts as one duplicate. If both "1001" and "2002" have been repeated in a row at least three times each consecutively, then that would be two duplicates, etc. The following would be the intended result:

ID Var1 Var2 Var3 Var4 Var5 Var6 Num_3ConsecutiveDuplications
1 1001 1001 1001 2002 2002 2002 2
2 2002 1001 1001 1001 NA 9999 1
3 1001 NA 1001 1001 9999 1234 0
4 3003 3003 3003 NA 2002 2002 1
5 NA NA NA 1001 1001 1001 1
6 3003 3003 3003 3003 3003 3003 1

Since in the actual data there are a significant range of possible values each variable can take on, and that there are over 40 variables, counting this by conditioning on each possible value and range of columns seems infeasible/inefficient. Any suggestions would be greatly appreciated.

CodePudding user response:

You can use rle():

library(dplyr)

df %>%
  rowwise() %>%
  mutate(x = sum(rle(across(-ID))$lengths >= 3))

# A tibble: 6 x 8
# Rowwise: 
     ID  Var1  Var2  Var3  Var4  Var5  Var6     x
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1     1  1001  1001  1001  2002  2002  2002     2
2     2  2002  1001  1001  1001    NA  9999     1
3     3  1001    NA  1001  1001  9999  1234     0
4     4  3003  3003  3003    NA  2002  2002     1
5     5    NA    NA    NA  1001  1001  1001     1
6     6  3003  3003  3003  3003  3003  3003     1
  • Related