R - consolidate missing data in multiple columns-CodePudding

Say that you have a data frame named df with 25 columns, first of them being named ID, second to thirteenth having names from A1 to A12 and fourteenth to twenty-fifth having names from B1 to B12. Values in the A and B variables can be missing data.

The task that I am facing is to consolidate the missingness in the data - if there is a missing entry in, say, 8th row of A4, then the 8th row of B4 also needs to be updated to NA, even if it has some data. This also works vice versa, if there is a missing entry in, say, 19th row of B11, then the 19th row of A11 also needs to be missing.

I can do this with two for loops:

for(i in 2:13){
  for(j in 1:nrow(df)){
    if(is.na(df[j,i 12])){
      df[j,i] <- NA
    }
  }
}

for(i in 14:25){
  for(j in 1:nrow(df)){
    if(is.na(df[j,i-12])){
      df[j,i] <- NA
    }
  }
}

However, I am looking for a solution that doesn't include for loops and is preferrably in tidyverse. How could this be done more efficiently?

CodePudding user response：

How about this?

#create dataset
library(tidyverse)
library(missForest)
df <- data.frame(id = c(1:10))
df[paste0("a", 1:10)] <- lapply(1:10, function(x) rnorm(10, x))
df[paste0("b", 1:10)] <- lapply(1:10, function(x) rnorm(10, x))
df <- bind_cols(df[1], missForest::prodNA(df[-1], noNA = 0.2)) #add NAs
df

purrr::map over variables:

df[paste0("a", 1:10)] <- map2(df %>% select(starts_with("a")), df %>% select(starts_with("b")),
                                                   ~ ifelse(is.na(.y), NA, .x))
df[paste0("b", 1:10)] <- map2(df %>% select(starts_with("b")), df %>% select(starts_with("a")),
                              ~ ifelse(is.na(.y), NA, .x))
df

CodePudding user response：

We could pivot to long format, then in a given row that contains NA, replace all values by NA, then pivot back to wide:

spec <- 
  df %>% 
  build_longer_spec(cols = -ID,
                    names_to = c(".value", "set"),
                    names_pattern = "(. )(\\d )",
                    values_to = "value")

df %>% 
  pivot_longer_spec(spec) %>% 
  print() %>% 
  # Intermediary long format:
  #> # A tibble: 6 x 4
  #>      ID set       A     B
  #>   <int> <chr> <dbl> <dbl>
  #> 1     1 1         1    NA
  #> 2     1 2         4    10
  #> 3     2 1         2     8
  #> 4     2 2         5    NA
  #> 5     3 1         3     9
  #> 6     3 2        NA    12
  rowwise(ID, set) %>% 
  mutate(across(everything(), 
                ~ ifelse(any(is.na(c_across(everything()))), NA, .x))) %>% 
  pivot_wider_spec(spec)

#> # A tibble: 3 x 5
#>      ID    A1    A2    B1    B2
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1    NA     4    NA    10
#> 2     2     2    NA     8    NA
#> 3     3     3    NA     9    NA

With sample data:

df <- data.frame(
  ID = 1:3,
  A1 = c(1, 2, 3),
  A2 = c(4, 5, NA),
  B1 = c(NA, 8, 9),
  B2 = c(10, NA, 12)
)
df
#>   ID A1 A2 B1 B2
#> 1  1  1  4 NA 10
#> 2  2  2  5  8 NA
#> 3  3  3 NA  9 12

CodePudding user response：

We could create masks of the NA values in both sub-tables, combine them, and apply them back to both sub-tables:

na_mask <- is.na(df[2:13]) | is.na(df[14:25])

df[2:13][na_mask] <- NA
df[14:25][na_mask] <- NA

CodePudding user response：

I believe that the suggested solution should be sufficiently performant. Looping over the columns is really not a big deal.

Thanks for voting up and accepting as answer if you like it.

# Create data
nrows <- 10
ncols <- 25
df1 <- as.data.frame(matrix(1:(nrows*ncols), nrow = nrows))
colnames(df1) <- c(
  "ID",
  paste0("A", 1:12),
  paste0("B", 1:12)
)

df1[1:3, c("A1")] <- NA
df1[5:7, c("B5")] <- NA

# Prepare calculation
cols <- as.list(as.data.frame(
  matrix(c(paste0("A", 1:12),
           paste0("B", 1:12)), nrow = 2, byrow = TRUE)
))

# Do calculation
for (col in cols) {
  missing <- is.na(rowSums(df1[col]))
  df1[missing, col] <- NA
}

CodePudding user response：

I've got a solution that showcases some advantages of reformatting one's data.

Let me first generate some data (since we received none)

library(tidyverse)

sample_nrows <- 10
full_df <- 
  tibble(
    ID = rep(seq_len(sample_nrows), each = 12   12),
    name = c(str_c("A", 1:12),
             str_c("B", 1:12)) %>%
      rep(sample_nrows),
    value = 
      rgamma(
        n = 12 * 2 * sample_nrows,
        shape = sample.int(20, size = 10)
      ) %>% 
      round())

full_df %>% 
  #' add 10% missingness
  mutate(value = if_else(rbinom(n(), size = 1, prob = 0.1) %>% as.logical(), NA_real_, value)) %>% 
  #' reconstruct into wide-format
  pivot_wider() %>%
  print(n = Inf, width = Inf) ->
  partial_df

Thus, we generate gamma-distributed data, for sample_nrows-rows (that's full_df), and we add 10% missing data to the whole thing and call it partial_df.

This formatting gives us a novel idea. Working with the long-format will give this result...

partial_df %>% 
  pivot_longer(-ID) %>% 
  tidyr::extract(name, c("name", "var_id"), regex = "(\\D )(\\d )") %>% 
  pivot_wider(names_from = "name") %>% 
  mutate(
    both_na = is.na(A) | is.na(B),
    A = if_else(both_na, NA_real_, A),
    B = if_else(both_na, NA_real_, B),
    both_na = NULL
  ) -> partial_df_with_NAs

To get back to the original format:

partial_df_with_NAs %>%
  pivot_wider(names_from = var_id, values_from = c(A,B), names_sep = "") %>% 
  print(n = Inf, width = Inf)