The question
I have a dataframe (~15000 rows, 90 columns) with columns that contain NA's. Here on SO I found multiple Q&A's about filling a NA with values from another df, or from a normal distribution. But these answers will destroy the present distribution in the columns themselfs. Example:
Person_ID | Var1 | Var2 |
---|---|---|
A | 1 | NA |
B | NA | 2 |
C | 2 | NA |
D | 1 | 4 |
E | 1 | 3 |
F | 1 | 1 |
G | NA | NA |
H | NA | 1 |
I | 2 | 2 |
J | 1 | NA |
K | 1 | 3 |
L | NA | 4 |
The column of Var1 has a distribution of 75% (1) and 25% (2). One NA should be replaced with a "2" and the others with a "1". Var2 has four values all 25%, one NA each should be replaced with one of the values. The real dataframe is larger in which each column has a finite number of unique numeric values. The real data consists of healthcare information and may not be shared externally.
Reason for the question
the goal is to perform a t-SNE on the dataframe, so a kNN imputation must be performed. The imputation will take more time then a replace for a quick look at the result. The answer would make it possible to take a quick look.
CodePudding user response:
In order to maintain the proportions of the values, in addition to the suggestion by @onyambu, it would be advisable to include the probability of the values while generating a sample.
df[,-1] <- data.frame(apply(df[,-1],
2,
function(x)
replace(x, is.na(x),
sample(sort(unique(na.omit(x))),
sum(is.na(x)),
replace = TRUE,
prob = prop.table(table(x))))))
Output:
> df
Person_ID Var1 Var2
1 A 1 4
2 B 1 2
3 C 2 2
4 D 1 4
5 E 1 3
6 F 1 1
7 G 1 3
8 H 1 1
9 I 2 2
10 J 1 4
11 K 1 3
12 L 1 4
CodePudding user response:
In base R you could do:
set.seed(5)
data.frame(lapply(df,\(x)replace(x,is.na(x),sample(na.omit(x),sum(is.na(x))))))
Person_ID Var1 Var2
1 A 1 3
2 B 1 2
3 C 2 1
4 D 1 4
5 E 1 3
6 F 1 1
7 G 1 3
8 H 1 1
9 I 2 2
10 J 1 1
11 K 1 3
12 L 2 4