Home > Enterprise >  Replace NA with the minimum of row median and column median
Replace NA with the minimum of row median and column median

Time:07-20

I have a dataset that looks like

    a1  a2  a3
A    1   1   1
B    2   NA  2
C    1   1   1

I want to replace NA with the minimum of column median and row median. Since the row median (for row "B") is 2 and the column median (for column "a2") is 1, I want to replace NA with 1 and get

    a1  a2  a3
A    1   1   1
B    2   1   2
C    1   1   1

I know how to replace NA with column median using dplyr:

mutate_if(is.numeric, ~replace_na(., median(., na.rm = TRUE)))

but how to get what I really need?

CodePudding user response:

I want to replace the NA with the minimum of column median and row median.

Let your matrix be mat. (If you have a data.frame df, extract its numeric columns and coerce them to a matrix using mat <- as.matrix(df[sapply(df, is.numeric)]).)

One vectorized solution, reliable even if we have multiple NAs in a row or column, is:

## you may need to install package "matrixStats" first
rmed <- matrixStats::rowMedians(mat, na.rm = TRUE)
cmed <- matrixStats::colMedians(mat, na.rm = TRUE)
ij <- which(is.na(mat), arr.ind = TRUE)
mat[ij] <- pmin(rmed[ij[, 1]], cmed[ij[, 2]])
mat

The code gives expected result for your 3 x 3 toy example. And here is a more complicated one for testing:

mat <- structure(c(6L, 4L, NA, NA, 2L, 8L, 8L, NA, NA, 7L, 7L, 4L, 4L, 
NA, NA, NA, 5L, 8L, NA, 9L, 6L, NA, 5L, 5L, 10L, 5L, NA, 6L, 
NA, 9L, NA, 6L, 5L, 1L, 10L, 7L, 3L, 1L, 2L, NA, 7L, NA, 8L, 
1L, 10L, 8L, 2L, 1L, NA, NA, 2L, NA, NA, 2L, NA, 10L, 6L, 6L, 
NA, 9L, 10L, 1L, 5L, 10L), dim = c(8L, 8L))
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#[1,]    6   NA    5   10    5    7   NA    6
#[2,]    4    7    8    5    1   NA   NA    6
#[3,]   NA    7   NA   NA   10    8    2   NA
#[4,]   NA    4    9    6    7    1   NA    9
#[5,]    2    4    6   NA    3   10   NA   10
#[6,]    8   NA   NA    9    1    8    2    1
#[7,]    8   NA    5   NA    2    2   NA    5
#[8,]   NA   NA    5    6   NA    1   10   10

The processing result is:

#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#[1,]    6  5.5  5.0   10    5  7.0    2    6
#[2,]    4  7.0  8.0    5    1  5.5    2    6
#[3,]    6  7.0  5.5    6   10  8.0    2    6
#[4,]    6  4.0  9.0    6    7  1.0    2    9
#[5,]    2  4.0  6.0    5    3 10.0    2   10
#[6,]    8  5.0  5.0    9    1  8.0    2    1
#[7,]    8  5.0  5.0    5    2  2.0    2    5
#[8,]    6  5.5  5.0    6    3  1.0   10   10

Note that the median of a vector x can be a value not in x. Say, median(1:6) is 3.5.

CodePudding user response:

Another possible solution, based on outer:

library(tidyverse)

outer(1:nrow(df), 1:ncol(df), 
  Vectorize(\(z,w) if (is.na(df[z,w])) 
    min(apply(df, 2, \(x) median(x, na.rm = T))[z],
        apply(df, 1, \(x) median(x, na.rm = T))[w])
    else df[z,w])) %>% 
  as.data.frame %>% 
  set_names(names(df)) %>% 
  `rownames<-`(rownames(df))

#>   a1 a2 a3
#> A  1  1  1
#> B  2  1  2
#> C  1  1  1
  • Related