I'm trying to test various imputation methods in R and I've written a function which takes a data frame, inserts some random NA values, imputes the missing values and then compares the imputation method back to the original data using MAE.
My function looks as follows:
pacman::p_load(tidyverse)
impute_diamonds_accuracy <- function(df, col, prop) {
require(tidyverse)
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[idx_na, col] - df[idx_na, col]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}
impute_diamonds_accuracy(df = diamonds, col = 'cut', prop = 0.02)
The function prints to the screen that it's doing the imputation but it fails when it performs that MAE calculation with the following error:
Error in imputed_df[idx_na, col] - df[idx_na, col] :
non-numeric argument to binary operator
How can I compare the original data against the imputed version to get a sense of the accuracy?
CodePudding user response:
diamonds
is a tibble.
> library(ggplot2)
> data(diamonds)
> is_tibble(diamonds)
[1] TRUE
so we may need to use [[
to extract the column as a vector. Also, the idx_na
returns the index of NA elements in data. If we want to use the subset comparison, make a copy of the original data before we assign NAs, and then do the comparison between the imputed and original data
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
-full code
impute_diamonds_accuracy <- function(df, col, prop) {
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
df_cpy <- data.table::copy(df)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- mice::complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}