Home > Software design >  How do I get R to search for duplicate values in a column
How do I get R to search for duplicate values in a column

Time:10-02

I have a dataset where I have a bunch of NAs -- the NAs occur in predictable patterns, representing a between-subjects manipulation. Example

Outcome New Variable Column NA NA NA NA NA 0 1 NA NA NA NA NA 1 2 NA NA NA NA NA

I want the New Variable Column to capture instances of the NA, NA, NA, NA, NA -- how do I tell R to search for a string of 5 NAs, then output a new name (lets call it 5X) for that string of 5 in a different column? Doesn't matter to me if the 5X term is only output once in the new column or for every string of 5 NAs.

CodePudding user response:

I think you might want the "run length encoding", see rle() function. Here is an example, not sure if I completely follow the output that you want, but regardless the RLE should allow you to find runs of 5 NA (or any other number of NAs) in a row (or "run")

d <- data.frame(
  variable = c(NA, NA, NA, NA, NA, 0, 1, NA, NA, NA, NA, NA, 1, 2, NA, NA, NA, NA, NA)
)

x <- rle(is.na(d$variable))
x
#> Run Length Encoding
#>   lengths: int [1:5] 5 2 5 2 5
#>   values : logi [1:5] TRUE FALSE TRUE FALSE TRUE

d$new_column <- do.call('c', sapply(seq_along(x$values), function(i) {
  if (x$values[i] && x$lengths[i] == 5) {
    rep("Infrequent", x$lengths[i])
  } else rep("Frequent", x$lengths[i])
}))

d
#>    variable new_column
#> 1        NA Infrequent
#> 2        NA Infrequent
#> 3        NA Infrequent
#> 4        NA Infrequent
#> 5        NA Infrequent
#> 6         0   Frequent
#> 7         1   Frequent
#> 8        NA Infrequent
#> 9        NA Infrequent
#> 10       NA Infrequent
#> 11       NA Infrequent
#> 12       NA Infrequent
#> 13        1   Frequent
#> 14        2   Frequent
#> 15       NA Infrequent
#> 16       NA Infrequent
#> 17       NA Infrequent
#> 18       NA Infrequent
#> 19       NA Infrequent

CodePudding user response:

Here is an alternative approach using data.table::rleid

library(data.table)

setDT(d)[,
  nc:=fifelse(.N>=5 & is.na(variable[1]),"infreq", "freq"),
  rleid(variable)
]

Output:

    variable     nc
       <num> <char>
 1:       NA infreq
 2:       NA infreq
 3:       NA infreq
 4:       NA infreq
 5:       NA infreq
 6:        0   freq
 7:        1   freq
 8:       NA infreq
 9:       NA infreq
10:       NA infreq
11:       NA infreq
12:       NA infreq
13:        1   freq
14:        2   freq
15:       NA infreq
16:       NA infreq
17:       NA infreq
18:       NA infreq
19:       NA infreq
  •  Tags:  
  • r
  • Related