Getting mean of specific column (not dataframe) and using it to replace every NAN value in related c-CodePudding

Suppose I have this dataset and it had 2 NAN values in columns 'alcohol' and 3 NAN values in column 'magnesium'. They do not have NAN values, but suppose they did.

What lines of code might I use to get not only the mean of the appropriate column (alcohol mean for alcohol), but also fill/replace alcohol NAN values with this mean? The same for magnesium.

There are questions on stackoverflow regarding a mean that is a mean of the entire dataframe as opposed to the column in particular.

I know this may be possible with sklearn.impute and sklearn.preprocessing

#data = load_wine()
#df = pd.DataFrame(data.data, columns=data.feature_names)
#df['target'] = pd.Series(data.target)

CodePudding user response：

Try this:

df.fillna(df[["alcohol", "magnesium"]].mean())

Example:

df = pd.DataFrame({
    "col1": [1, 2, 3, np.NaN, 5, 6],
    "alcohol": [1, 2, 3, np.NaN, np.NaN, 6],
    "magnesium": [1, np.NaN, np.NaN, np.NaN, 5, 6],
    "col4": [1, 2, 3, np.NaN, 5, 6]})

df.fillna(df[["alcohol", "magnesium"]].mean())

gives you:

   col1  alcohol  magnesium  col4
0   1.0      1.0        1.0   1.0
1   2.0      2.0        4.0   2.0
2   3.0      3.0        4.0   3.0
3   NaN      3.0        4.0   NaN
4   5.0      3.0        5.0   5.0
5   6.0      6.0        6.0   6.0

CodePudding user response：

df.mean() will give the mean per column, so you can use:

df.fillna(df.mean())

Note that if a column is full of null values the mean of that column will be null as well.