How to filter out data with conditional statement for series of numbers in R?-CodePudding

Data

Here is the data for my example:

#### Create Data ####
df <- data.frame(X1 = c(NA,1,1,1,0), 
                 X2 = c(1,1,1,0,0),
                 X3 = c(1,1,NA,0,0),
                 X4 = c(1,1,1,1,NA),
                 X5 = c(1,1,1,0,NA),
                 X6 = c(1,NA,1,1,NA)) %>% 
  as_tibble()

Problem

When you print the data, it looks like this:

# A tibble: 5 × 6
     X1    X2    X3    X4    X5    X6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    NA     1     1     1     1     1
2     1     1     1     1     1    NA
3     1     1    NA     1     1     1
4     1     0     0     1     0     1
5     0     0     0    NA    NA    NA

Basically there are cases where there is sporadic and random missingness in this data (rows 1-4). However, those with three zeroes in a row are those that have been converted to NA values after a stopping rule for multiple "wrong" answers (row 5). Theoretically I could just blindly remove these with the following code:

df %>% 
  mutate(across(everything(),
                ~ replace(.,
                          is.na(.),
                          0)))

And the NA's would be removed:

# A tibble: 5 × 6
     X1    X2    X3    X4    X5    X6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0     1     1     1     1     1
2     1     1     1     1     1     0
3     1     1     0     1     1     1
4     1     0     0     1     0     1
5     0     0     0     0     0     0

However, it appears that this does not faithfully attack the problem. The NAs that are random are actually missing whereas the values that have been made NA are not. So I need a way to conditionally filter these values out for all cases where three 0s are recorded in a row, however I'm struggling with figuring out how to do this.

CodePudding user response：

Using is.na we could paste0 the rows to strings and check if number of matches with 111 are greater than zero using stringi::stri_count to create a flag. After that, replace NAs with zeros if a flag is present.

num_NA <- 3
flag <- apply( (is.na(df)), 1, paste0, collapse='') |>
  stringi::stri_count(regex=paste(rep(1, num_NA), collapse='')) |> base::`>`(0)

df[flag, ] <- lapply(df[flag, ], \(x) replace(x, is.na(x), 0))
df
#   X1 X2 X3 X4 X5 X6
# 1 NA  1  1  1  1  1
# 2  1  1  1  1  1 NA
# 3  1  1 NA  1  1  1
# 4  1  0  0  1  0  1
# 5  0  0  0  0  0  0

Data:

df <- structure(list(X1 = c(NA, 1, 1, 1, 0), X2 = c(1, 1, 1, 0, 0), 
    X3 = c(1, 1, NA, 0, 0), X4 = c(1, 1, 1, 1, NA), X5 = c(1, 
    1, 1, 0, NA), X6 = c(1, NA, 1, 1, NA)), class = "data.frame", row.names = c(NA, 
-5L))

CodePudding user response：

This is sort of a non-answer, but too big for a comment. Doubling df:

df2 <- rbind(df, df)
> df2
   X1 X2 X3 X4 X5 X6
1  NA  1  1  1  1  1
2   1  1  1  1  1 NA
3   1  1 NA  1  1  1
4   1  0  0  1  0  1
5   0  0  0 NA NA NA
6  NA  1  1  1  1  1
7   1  1  1  1  1 NA
8   1  1 NA  1  1  1
9   1  0  0  1  0  1
10  0  0  0 NA NA NA

# fiddle with it
df2[3,] <- c(0,NA,0,NA,0,NA)

suspects <- which(rowSums(df2, na.rm = TRUE) == 0)
suspects
[1]  3  5 10

3 %in% rle(df2[suspects[3], ])$lengths
[1] TRUE
> 3 %in% rle(df2[suspects[1], ])$lengths
[1] FALSE

But, as this is related to 'faithfulness' in grading the consequences of a series, the above should just identify possible targets for rle to nail the 3 zeros in a row.