Multivariate Exploratory Analysis : Dataframe filling missing values with cluster mean values-CodePudding

Instead of filling missing values by 0 or by the variable mean, I would like to fill them with by the mean of the other similar observation on the dataset.

Example : A, B, C and D are a single sample of various measures.

    V1  V2  V3
A   8.7 4.3 5
B   nan 2.5 3
C   0.1 2.5 3
D   1.5 2.5 3

So doing a K-Means clustering on variable V2 and V3. Returns 2 clusters : one with A and second one with B, C, D. Because the 2nd cluster is the same as B, I want to fill the missing value on variable V1 with the 2nd cluster mean values for V1

So the missing value will be 0.8 for row B in V1 because is the mean of 0.1 and 1.5 corresponding to C and D values on V1.

This is a very simple example so I would like to know how to do this with Python for a large dataset.

Thanks for your help for a code able to do that quickly and to fill "automatically" the missing values in that way.

CodePudding user response：

Use KNNInputer from sklearn:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {'V1': [8.7, np.nan, 0.1, 1.5],
        'V2': [4.3, 2.5, 2.5, 2.5],
        'V3': [5, 3, 3, 3]}
df = pd.DataFrame(data)

imputer = KNNImputer(n_neighbors=2)
out = imputer.fit_transform(df)
out = pd.DataFrame(out, index=df.index, columns=df.columns)

Output:

>>> out
    V1   V2   V3
0  8.7  4.3  5.0
1  0.8  2.5  3.0
2  0.1  2.5  3.0
3  1.5  2.5  3.0