I am working on a data frame with multiple data types.I would like to replace NA values only on numerical columns with the median of that particular column.I have seen questions on replacing with mean a lot, but not median. My df is similar to the following code:
my_groups <- c(rep("A", 5), rep("B",5))
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_values_2 <- c(3, NA, 4, 8, 2, 11, 15, NA, 9, 10)
my_df <- data.frame(my_groups, my_values_1, my_values_2)
my_df %>% select_if(is.numeric)
This gives me numerical columns, but I cant figure out the next step.
CodePudding user response:
1) Inserting some NA's into the first column of the built-in BOD we have:
library(dplyr)
BOD$Time[1:2] <- NA
na.median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))
BOD %>% mutate(across(where(is.numeric), na.median))
giving:
Time demand
1 4.5 8.3
2 4.5 10.3
3 3.0 19.0
4 4.0 16.0
5 5.0 15.6
6 7.0 19.8
2) or using only base R with na.median
from above:
ok <- sapply(BOD, is.numeric)
replace(BOD, ok, lapply(BOD[ok], na.median))
CodePudding user response:
We could use mutate
with across
and an ifelse
statement:
Note: D. Grothendieck answer works also perfect!
library(dplyr)
my_df %>%
mutate(across(where(is.numeric), ~ifelse(is.na(.), median(.,na.rm=TRUE), .)))
output:
my_groups my_values_1 my_values_2
1 A 4.0 3.0
2 A 9.0 8.5
3 A 10.0 4.0
4 A 8.5 8.0
5 A 5.0 2.0
6 B 12.0 11.0
7 B 8.5 15.0
8 B 7.0 8.5
9 B 11.0 9.0
10 B 8.0 10.0