I want to imputate null values to df[col].mean()
when df[col]
is not all null values.
I implement code like below:
if train_x[cols].isna().sum() == len(train_x): # need to fix
train_x.loc[:, cols] = train_x[cols].fillna(value=0.0)
else:
train_x.loc[:, cols] = train_x[cols].fillna(value=train_x[cols].mean())
upper code has error, because train_x[cols]
is dataframe.. but I need to put single column under condition..
is there better way to implement my purpose?
sorry for my poor English skills.
CodePudding user response:
To impute missing values in a Pandas DataFrame, you can use the fillna() method. This method allows you to replace missing values with a specific value or with the mean of the non-null values in the column.
import pandas as pd
# Load the DataFrame
df = pd.read_csv('data.csv')
# Select the column with missing values
col = 'column_name'
# Calculate the mean of the non-null values in the column
mean_val = df[col].mean()
# Replace missing values with the mean
df[col] = df[col].fillna(mean_val)
This will replace all the missing values in df[col] with the mean of the non-null values in the column.
Alternatively, you can use the SimpleImputer class from scikit-learn to impute missing values. Here is an example of how you can use it:
from sklearn.impute import SimpleImputer
# Create an instance of the SimpleImputer class
imputer = SimpleImputer(strategy='mean')
# Select the column with missing values
X = df[['column_name']]
# Fit the imputer to the data and transform the data
X = imputer.fit_transform(X)
# Replace the missing values in the original DataFrame
df['column_name'] = X
This will replace all the missing values in df['column_name'] with the mean of the non-null values in the column.