Replacing NAs for numerical variables with median-CodePudding

My dataframe is a csv read into r using fread:

df <- fread("df.csv", stringsAsFactors = T)

Here is a screenshot of the partial dataframe (the entire dataframe is 1177 rows long and consist of a few hundred NAs in certain cells):

This is my R code to replace NAs for each cell in the dataframe, df with the median of the respective column.

  for (j in 1:length(df)){
    if(is.na(df[i][j])){
      if (class(df[[j]])== 'numeric' || class(df[[j]])== 'integer' ){
        df[i][j]= median(df[[j]], na.rm=T)
      }
    }
  }
}

There is this error currently:There were 50 or more warnings (use warnings() to see the first 50)

I am guessing there is something wrong with how I refer to each cell of the dataframe df, which is df[i][j] currently. Can I get some advice?

CodePudding user response：

No loop required:

base R

diam <- head(ggplot2::diamonds)
diam$table[3:4] <- NA
diam$x[c(1,5)] <- NA
diam
# # A tibble: 6 x 10
#   carat cut       color clarity depth table price     x     y     z
#   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1  0.23 Ideal     E     SI2      61.5    55   326 NA     3.98  2.43
# 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
# 3  0.23 Good      E     VS1      56.9    NA   327  4.05  4.07  2.31
# 4  0.29 Premium   I     VS2      62.4    NA   334  4.2   4.23  2.63
# 5  0.31 Good      J     SI2      63.3    58   335 NA     4.35  2.75
# 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
isnum <- sapply(diam, is.numeric)
diam[isnum] <- lapply(diam[isnum], function(z) ifelse(is.na(z), median(z, na.rm = TRUE), z))
diam
# # A tibble: 6 x 10
#   carat cut       color clarity depth table price     x     y     z
#   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1  0.23 Ideal     E     SI2      61.5  55     326  4.00  3.98  2.43
# 2  0.21 Premium   E     SI1      59.8  61     326  3.89  3.84  2.31
# 3  0.23 Good      E     VS1      56.9  57.5   327  4.05  4.07  2.31
# 4  0.29 Premium   I     VS2      62.4  57.5   334  4.2   4.23  2.63
# 5  0.31 Good      J     SI2      63.3  58     335  4.00  4.35  2.75
# 6  0.24 Very Good J     VVS2     62.8  57     336  3.94  3.96  2.48

(ggplot2 is only used here as a source of a diverse dataset. Otherwise this solution is base-R.)

dplyr

library(dplyr)
diam %>%
  mutate(across(where(is.numeric), ~ if_else(is.na(.), median(., na.rm = TRUE), as.numeric(.))))
# # A tibble: 6 x 10
#   carat cut       color clarity depth table price     x     y     z
#   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  0.23 Ideal     E     SI2      61.5    55   326 NA     3.98  2.43
# 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
# 3  0.23 Good      E     VS1      56.9    NA   327  4.05  4.07  2.31
# 4  0.29 Premium   I     VS2      62.4    NA   334  4.2   4.23  2.63
# 5  0.31 Good      J     SI2      63.3    58   335 NA     4.35  2.75
# 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Side note

It appears that your data is or should be grouped/stratified (group among others). What you're talking about is "imputation", and replacing missing data with the median should arguably be done by group whenever possible so as to not introduce bias into your analysis. There are numerous ways to approach this, too many to start discussing here, but I implore you to look into imputation and, if necessary, open a new question into best imputation strategies for this data (on https://stats.stackexchange.com/, likely).

CodePudding user response：

Do the following mutate step for any field that has NAs in it.

df %>%
   mutate(BMI = case_when(is.na(BMI) ~ median(BMI), TRUE ~ BMI)