Home > OS >  Replace NA in a dataframe, keeping the column value distribution
Replace NA in a dataframe, keeping the column value distribution

Time:06-11

The question

I have a dataframe (~15000 rows, 90 columns) with columns that contain NA's. Here on SO I found multiple Q&A's about filling a NA with values from another df, or from a normal distribution. But these answers will destroy the present distribution in the columns themselfs. Example:

Person_ID Var1 Var2
A 1 NA
B NA 2
C 2 NA
D 1 4
E 1 3
F 1 1
G NA NA
H NA 1
I 2 2
J 1 NA
K 1 3
L NA 4

The column of Var1 has a distribution of 75% (1) and 25% (2). One NA should be replaced with a "2" and the others with a "1". Var2 has four values all 25%, one NA each should be replaced with one of the values. The real dataframe is larger in which each column has a finite number of unique numeric values. The real data consists of healthcare information and may not be shared externally.

Reason for the question

the goal is to perform a t-SNE on the dataframe, so a kNN imputation must be performed. The imputation will take more time then a replace for a quick look at the result. The answer would make it possible to take a quick look.

CodePudding user response:

In order to maintain the proportions of the values, in addition to the suggestion by @onyambu, it would be advisable to include the probability of the values while generating a sample.

df[,-1] <- data.frame(apply(df[,-1], 
                            2, 
                            function(x) 
                              replace(x, is.na(x),
                                      sample(sort(unique(na.omit(x))), 
                                             sum(is.na(x)), 
                                             replace = TRUE, 
                                             prob = prop.table(table(x))))))

Output:

> df
   Person_ID Var1 Var2
1          A    1    4
2          B    1    2
3          C    2    2
4          D    1    4
5          E    1    3
6          F    1    1
7          G    1    3
8          H    1    1
9          I    2    2
10         J    1    4
11         K    1    3
12         L    1    4

CodePudding user response:

In base R you could do:

set.seed(5)  
data.frame(lapply(df,\(x)replace(x,is.na(x),sample(na.omit(x),sum(is.na(x))))))

   Person_ID Var1 Var2
1          A    1    3
2          B    1    2
3          C    2    1
4          D    1    4
5          E    1    3
6          F    1    1
7          G    1    3
8          H    1    1
9          I    2    2
10         J    1    1
11         K    1    3
12         L    2    4
  • Related