Say that you have a data frame named df
with 25 columns, first of them being named ID
, second to thirteenth having names from A1
to A12
and fourteenth to twenty-fifth having names from B1
to B12
. Values in the A and B variables can be missing data.
The task that I am facing is to consolidate the missingness in the data - if there is a missing entry in, say, 8th row of A4, then the 8th row of B4 also needs to be updated to NA, even if it has some data. This also works vice versa, if there is a missing entry in, say, 19th row of B11, then the 19th row of A11 also needs to be missing.
I can do this with two for
loops:
for(i in 2:13){
for(j in 1:nrow(df)){
if(is.na(df[j,i 12])){
df[j,i] <- NA
}
}
}
for(i in 14:25){
for(j in 1:nrow(df)){
if(is.na(df[j,i-12])){
df[j,i] <- NA
}
}
}
However, I am looking for a solution that doesn't include for
loops and is preferrably in tidyverse. How could this be done more efficiently?
CodePudding user response:
How about this?
#create dataset
library(tidyverse)
library(missForest)
df <- data.frame(id = c(1:10))
df[paste0("a", 1:10)] <- lapply(1:10, function(x) rnorm(10, x))
df[paste0("b", 1:10)] <- lapply(1:10, function(x) rnorm(10, x))
df <- bind_cols(df[1], missForest::prodNA(df[-1], noNA = 0.2)) #add NAs
df
purrr::map
over variables:
df[paste0("a", 1:10)] <- map2(df %>% select(starts_with("a")), df %>% select(starts_with("b")),
~ ifelse(is.na(.y), NA, .x))
df[paste0("b", 1:10)] <- map2(df %>% select(starts_with("b")), df %>% select(starts_with("a")),
~ ifelse(is.na(.y), NA, .x))
df
CodePudding user response:
We could pivot to long format, then in a given row that contains NA
, replace all values by NA
, then pivot back to wide:
spec <-
df %>%
build_longer_spec(cols = -ID,
names_to = c(".value", "set"),
names_pattern = "(. )(\\d )",
values_to = "value")
df %>%
pivot_longer_spec(spec) %>%
print() %>%
# Intermediary long format:
#> # A tibble: 6 x 4
#> ID set A B
#> <int> <chr> <dbl> <dbl>
#> 1 1 1 1 NA
#> 2 1 2 4 10
#> 3 2 1 2 8
#> 4 2 2 5 NA
#> 5 3 1 3 9
#> 6 3 2 NA 12
rowwise(ID, set) %>%
mutate(across(everything(),
~ ifelse(any(is.na(c_across(everything()))), NA, .x))) %>%
pivot_wider_spec(spec)
#> # A tibble: 3 x 5
#> ID A1 A2 B1 B2
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 4 NA 10
#> 2 2 2 NA 8 NA
#> 3 3 3 NA 9 NA
With sample data:
df <- data.frame(
ID = 1:3,
A1 = c(1, 2, 3),
A2 = c(4, 5, NA),
B1 = c(NA, 8, 9),
B2 = c(10, NA, 12)
)
df
#> ID A1 A2 B1 B2
#> 1 1 1 4 NA 10
#> 2 2 2 5 8 NA
#> 3 3 3 NA 9 12
CodePudding user response:
We could create masks of the NA values in both sub-tables, combine them, and apply them back to both sub-tables:
na_mask <- is.na(df[2:13]) | is.na(df[14:25])
df[2:13][na_mask] <- NA
df[14:25][na_mask] <- NA
CodePudding user response:
I believe that the suggested solution should be sufficiently performant. Looping over the columns is really not a big deal.
Thanks for voting up and accepting as answer if you like it.
# Create data
nrows <- 10
ncols <- 25
df1 <- as.data.frame(matrix(1:(nrows*ncols), nrow = nrows))
colnames(df1) <- c(
"ID",
paste0("A", 1:12),
paste0("B", 1:12)
)
df1[1:3, c("A1")] <- NA
df1[5:7, c("B5")] <- NA
# Prepare calculation
cols <- as.list(as.data.frame(
matrix(c(paste0("A", 1:12),
paste0("B", 1:12)), nrow = 2, byrow = TRUE)
))
# Do calculation
for (col in cols) {
missing <- is.na(rowSums(df1[col]))
df1[missing, col] <- NA
}
CodePudding user response:
I've got a solution that showcases some advantages of reformatting one's data.
Let me first generate some data (since we received none)
library(tidyverse)
sample_nrows <- 10
full_df <-
tibble(
ID = rep(seq_len(sample_nrows), each = 12 12),
name = c(str_c("A", 1:12),
str_c("B", 1:12)) %>%
rep(sample_nrows),
value =
rgamma(
n = 12 * 2 * sample_nrows,
shape = sample.int(20, size = 10)
) %>%
round())
full_df %>%
#' add 10% missingness
mutate(value = if_else(rbinom(n(), size = 1, prob = 0.1) %>% as.logical(), NA_real_, value)) %>%
#' reconstruct into wide-format
pivot_wider() %>%
print(n = Inf, width = Inf) ->
partial_df
Thus, we generate gamma
-distributed data, for sample_nrows
-rows (that's full_df
), and we add 10% missing data to the whole thing and call it partial_df
.
This formatting gives us a novel idea. Working with the long-format will give this result...
partial_df %>%
pivot_longer(-ID) %>%
tidyr::extract(name, c("name", "var_id"), regex = "(\\D )(\\d )") %>%
pivot_wider(names_from = "name") %>%
mutate(
both_na = is.na(A) | is.na(B),
A = if_else(both_na, NA_real_, A),
B = if_else(both_na, NA_real_, B),
both_na = NULL
) -> partial_df_with_NAs
To get back to the original format:
partial_df_with_NAs %>%
pivot_wider(names_from = var_id, values_from = c(A,B), names_sep = "") %>%
print(n = Inf, width = Inf)