My dataframe is a csv read into r using fread:
df <- fread("df.csv", stringsAsFactors = T)
Here is a screenshot of the partial dataframe (the entire dataframe is 1177 rows long and consist of a few hundred NAs in certain cells):
This is my R code to replace NAs for each cell in the dataframe, df with the median of the respective column.
for (j in 1:length(df)){
if(is.na(df[i][j])){
if (class(df[[j]])== 'numeric' || class(df[[j]])== 'integer' ){
df[i][j]= median(df[[j]], na.rm=T)
}
}
}
}
There is this error currently:There were 50 or more warnings (use warnings() to see the first 50)
I am guessing there is something wrong with how I refer to each cell of the dataframe df, which is df[i][j] currently. Can I get some advice?
CodePudding user response:
No loop required:
base R
diam <- head(ggplot2::diamonds)
diam$table[3:4] <- NA
diam$x[c(1,5)] <- NA
diam
# # A tibble: 6 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 NA 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 NA 327 4.05 4.07 2.31
# 4 0.29 Premium I VS2 62.4 NA 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 NA 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
isnum <- sapply(diam, is.numeric)
diam[isnum] <- lapply(diam[isnum], function(z) ifelse(is.na(z), median(z, na.rm = TRUE), z))
diam
# # A tibble: 6 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 4.00 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 57.5 327 4.05 4.07 2.31
# 4 0.29 Premium I VS2 62.4 57.5 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.00 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
(ggplot2
is only used here as a source of a diverse dataset. Otherwise this solution is base-R.)
dplyr
library(dplyr)
diam %>%
mutate(across(where(is.numeric), ~ if_else(is.na(.), median(., na.rm = TRUE), as.numeric(.))))
# # A tibble: 6 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 NA 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 NA 327 4.05 4.07 2.31
# 4 0.29 Premium I VS2 62.4 NA 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 NA 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Side note
It appears that your data is or should be grouped/stratified (group
among others). What you're talking about is "imputation", and replacing missing data with the median should arguably be done by group whenever possible so as to not introduce bias into your analysis. There are numerous ways to approach this, too many to start discussing here, but I implore you to look into imputation and, if necessary, open a new question into best imputation strategies for this data (on https://stats.stackexchange.com/, likely).
CodePudding user response:
Do the following mutate
step for any field that has NAs in it.
df %>%
mutate(BMI = case_when(is.na(BMI) ~ median(BMI), TRUE ~ BMI)