I have a dataset that looks like
a1 a2 a3
A 1 1 1
B 2 NA 2
C 1 1 1
I want to replace NA with the minimum of column median and row median. Since the row median (for row "B") is 2 and the column median (for column "a2") is 1, I want to replace NA with 1 and get
a1 a2 a3
A 1 1 1
B 2 1 2
C 1 1 1
I know how to replace NA with column median using dplyr:
mutate_if(is.numeric, ~replace_na(., median(., na.rm = TRUE)))
but how to get what I really need?
CodePudding user response:
I want to replace the NA with the minimum of column median and row median.
Let your matrix be mat
. (If you have a data.frame df
, extract its numeric columns and coerce them to a matrix using mat <- as.matrix(df[sapply(df, is.numeric)])
.)
One vectorized solution, reliable even if we have multiple NA
s in a row or column, is:
## you may need to install package "matrixStats" first
rmed <- matrixStats::rowMedians(mat, na.rm = TRUE)
cmed <- matrixStats::colMedians(mat, na.rm = TRUE)
ij <- which(is.na(mat), arr.ind = TRUE)
mat[ij] <- pmin(rmed[ij[, 1]], cmed[ij[, 2]])
mat
The code gives expected result for your 3 x 3 toy example. And here is a more complicated one for testing:
mat <- structure(c(6L, 4L, NA, NA, 2L, 8L, 8L, NA, NA, 7L, 7L, 4L, 4L,
NA, NA, NA, 5L, 8L, NA, 9L, 6L, NA, 5L, 5L, 10L, 5L, NA, 6L,
NA, 9L, NA, 6L, 5L, 1L, 10L, 7L, 3L, 1L, 2L, NA, 7L, NA, 8L,
1L, 10L, 8L, 2L, 1L, NA, NA, 2L, NA, NA, 2L, NA, 10L, 6L, 6L,
NA, 9L, 10L, 1L, 5L, 10L), dim = c(8L, 8L))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#[1,] 6 NA 5 10 5 7 NA 6
#[2,] 4 7 8 5 1 NA NA 6
#[3,] NA 7 NA NA 10 8 2 NA
#[4,] NA 4 9 6 7 1 NA 9
#[5,] 2 4 6 NA 3 10 NA 10
#[6,] 8 NA NA 9 1 8 2 1
#[7,] 8 NA 5 NA 2 2 NA 5
#[8,] NA NA 5 6 NA 1 10 10
The processing result is:
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#[1,] 6 5.5 5.0 10 5 7.0 2 6
#[2,] 4 7.0 8.0 5 1 5.5 2 6
#[3,] 6 7.0 5.5 6 10 8.0 2 6
#[4,] 6 4.0 9.0 6 7 1.0 2 9
#[5,] 2 4.0 6.0 5 3 10.0 2 10
#[6,] 8 5.0 5.0 9 1 8.0 2 1
#[7,] 8 5.0 5.0 5 2 2.0 2 5
#[8,] 6 5.5 5.0 6 3 1.0 10 10
Note that the median of a vector x
can be a value not in x
. Say, median(1:6)
is 3.5.
CodePudding user response:
Another possible solution, based on outer
:
library(tidyverse)
outer(1:nrow(df), 1:ncol(df),
Vectorize(\(z,w) if (is.na(df[z,w]))
min(apply(df, 2, \(x) median(x, na.rm = T))[z],
apply(df, 1, \(x) median(x, na.rm = T))[w])
else df[z,w])) %>%
as.data.frame %>%
set_names(names(df)) %>%
`rownames<-`(rownames(df))
#> a1 a2 a3
#> A 1 1 1
#> B 2 1 2
#> C 1 1 1