Calculate the accuracy of an imputation function in R-CodePudding

I'm trying to test various imputation methods in R and I've written a function which takes a data frame, inserts some random NA values, imputes the missing values and then compares the imputation method back to the original data using MAE.

My function looks as follows:

pacman::p_load(tidyverse)

impute_diamonds_accuracy <- function(df, col, prop) {
  require(tidyverse)
  
  # Sample the indices of the rows to convert to NA
  n <- nrow(df)
  idx_na <- sample(1:n, prop*n)
  
  # Convert the values at the sampled indices to NA
  df[idx_na, col] <- NA
  
  # Impute missing values using mice with pmm method
  imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
  imputed_df <- complete(imputed_df)
  
  # Calculate MAE between imputed and original values
  mae <- mean(abs(imputed_df[idx_na, col] - df[idx_na, col]), na.rm = TRUE)
  
  return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}

impute_diamonds_accuracy(df = diamonds, col = 'cut', prop = 0.02)

The function prints to the screen that it's doing the imputation but it fails when it performs that MAE calculation with the following error:

Error in imputed_df[idx_na, col] - df[idx_na, col] : 
  non-numeric argument to binary operator

How can I compare the original data against the imputed version to get a sense of the accuracy?

CodePudding user response：

diamonds is a tibble.

> library(ggplot2)
> data(diamonds)
> is_tibble(diamonds)
[1] TRUE

so we may need to use [[ to extract the column as a vector. Also, the idx_na returns the index of NA elements in data. If we want to use the subset comparison, make a copy of the original data before we assign NAs, and then do the comparison between the imputed and original data

 mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)

-full code

impute_diamonds_accuracy <- function(df, col, prop) {
  
  
  # Sample the indices of the rows to convert to NA
  n <- nrow(df)
  idx_na <- sample(1:n, prop*n)
  df_cpy <- data.table::copy(df)
  
  # Convert the values at the sampled indices to NA
  df[idx_na, col] <- NA
  
  # Impute missing values using mice with pmm method
  imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
  imputed_df <- mice::complete(imputed_df)
  
  # Calculate MAE between imputed and original values
 mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
 
  return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
  
}