Suppose I have this dataset and it had 2 NAN values in columns 'alcohol' and 3 NAN values in column 'magnesium'. They do not have NAN values, but suppose they did.
What lines of code might I use to get not only the mean of the appropriate column (alcohol mean for alcohol), but also fill/replace alcohol NAN values with this mean? The same for magnesium.
There are questions on stackoverflow regarding a mean that is a mean of the entire dataframe as opposed to the column in particular.
I know this may be possible with sklearn.impute and sklearn.preprocessing
#data = load_wine()
#df = pd.DataFrame(data.data, columns=data.feature_names)
#df['target'] = pd.Series(data.target)
CodePudding user response:
Try this:
df.fillna(df[["alcohol", "magnesium"]].mean())
Example:
df = pd.DataFrame({
"col1": [1, 2, 3, np.NaN, 5, 6],
"alcohol": [1, 2, 3, np.NaN, np.NaN, 6],
"magnesium": [1, np.NaN, np.NaN, np.NaN, 5, 6],
"col4": [1, 2, 3, np.NaN, 5, 6]})
df.fillna(df[["alcohol", "magnesium"]].mean())
gives you:
col1 alcohol magnesium col4
0 1.0 1.0 1.0 1.0
1 2.0 2.0 4.0 2.0
2 3.0 3.0 4.0 3.0
3 NaN 3.0 4.0 NaN
4 5.0 3.0 5.0 5.0
5 6.0 6.0 6.0 6.0
CodePudding user response:
df.mean()
will give the mean per column, so you can use:
df.fillna(df.mean())
Note that if a column is full of null values the mean of that column will be null as well.