Instead of filling missing values by 0 or by the variable mean, I would like to fill them with by the mean of the other similar observation on the dataset.
Example : A, B, C and D are a single sample of various measures.
V1 V2 V3
A 8.7 4.3 5
B nan 2.5 3
C 0.1 2.5 3
D 1.5 2.5 3
So doing a K-Means clustering on variable V2 and V3. Returns 2 clusters : one with A and second one with B, C, D. Because the 2nd cluster is the same as B, I want to fill the missing value on variable V1 with the 2nd cluster mean values for V1
So the missing value will be 0.8 for row B in V1 because is the mean of 0.1 and 1.5 corresponding to C and D values on V1.
This is a very simple example so I would like to know how to do this with Python for a large dataset.
Thanks for your help for a code able to do that quickly and to fill "automatically" the missing values in that way.
CodePudding user response:
Use KNNInputer
from sklearn
:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'V1': [8.7, np.nan, 0.1, 1.5],
'V2': [4.3, 2.5, 2.5, 2.5],
'V3': [5, 3, 3, 3]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
out = imputer.fit_transform(df)
out = pd.DataFrame(out, index=df.index, columns=df.columns)
Output:
>>> out
V1 V2 V3
0 8.7 4.3 5.0
1 0.8 2.5 3.0
2 0.1 2.5 3.0
3 1.5 2.5 3.0