I have a dataset and I want to replace NAs with empty string in those columns where the number of missing values is greater or equal to n
. For instance, n = 500
.
set.seed(2022)
synthetic <- tibble(
col1 = runif(1000),
col2 = runif(1000),
col3 = runif(1000)
)
na_insert <- c(sample(nrow(synthetic), 500, replace = FALSE))
synthetic[na_insert, 1] <- NA
What I am trying to do and eventually fail:
synthetic %>%
mutate(across(everything(), ~ replace_na(sum(is.na(.x)) >= 500, "")))
What am I doing wrong in this primitive exercise?
CodePudding user response:
You could make use of where
with a purrr
-like function:
library(dplyr)
synthetic %>%
mutate(across(where(~sum(is.na(.x)) >= 500), ~coalesce(as.character(.x), "")))
This returns
# A tibble: 1,000 x 3
col1 col2 col3
<chr> <dbl> <dbl>
1 "" 0.479 0.139
2 "0.647259329678491" 0.410 0.770
3 "" 0.696 0.805
4 "" 0.863 0.803
5 "0.184729989385232" 0.146 0.652
6 "0.635790845612064" 0.634 0.0830
7 "" 0.305 0.527
8 "0.0419759317301214" 0.297 0.275
9 "" 0.883 0.698
10 "0.757252902723849" 0.115 0.933
# ... with 990 more rows
CodePudding user response:
Using ifelse
function:
library(dplyr)
synthetic |>
mutate_all(~ifelse(
sum(is.na(.)) >= 500 & is.na(.),
"",
.
))
Output:
# A tibble: 1,000 x 3
col1 col2 col3
<chr> <dbl> <dbl>
1 "" 0.479 0.139
2 "0.647259329678491" 0.410 0.770
3 "" 0.696 0.805
4 "" 0.863 0.803
5 "0.184729989385232" 0.146 0.652
6 "0.635790845612064" 0.634 0.0830
7 "" 0.305 0.527
8 "0.0419759317301214" 0.297 0.275
9 "" 0.883 0.698
10 "0.757252902723849" 0.115 0.933
Edit:
Using across
and not mutate_all
:
synthetic |>
mutate(across(everything(),
~ ifelse(sum(is.na(
.
)) >= 500 & is.na(.),
"",
.)))
CodePudding user response:
library(data.table)
n <- 500
# convert all to character
setDT(synthetic)[, names(synthetic) := lapply(.SD, as.character)]
# find columns with >= 500 NA's
cols <- which(colSums(is.na(synthetic)) >= n)
# fast!! replace all NA in the found columns to ""
for(col in cols) set(synthetic,
i = which(is.na(synthetic[[col]])),
j = col,
value = "")