Pandas Imputation under condition?-CodePudding

I want to imputate null values to df[col].mean() when df[col] is not all null values.

I implement code like below:

if train_x[cols].isna().sum() == len(train_x): # need to fix
    train_x.loc[:, cols] = train_x[cols].fillna(value=0.0)
else:
    train_x.loc[:, cols] = train_x[cols].fillna(value=train_x[cols].mean())

upper code has error, because train_x[cols] is dataframe.. but I need to put single column under condition.. is there better way to implement my purpose?

sorry for my poor English skills.

CodePudding user response：

To impute missing values in a Pandas DataFrame, you can use the fillna() method. This method allows you to replace missing values with a specific value or with the mean of the non-null values in the column.

import pandas as pd

# Load the DataFrame
df = pd.read_csv('data.csv')

# Select the column with missing values
col = 'column_name'

# Calculate the mean of the non-null values in the column
mean_val = df[col].mean()

# Replace missing values with the mean
df[col] = df[col].fillna(mean_val)

This will replace all the missing values in df[col] with the mean of the non-null values in the column.

Alternatively, you can use the SimpleImputer class from scikit-learn to impute missing values. Here is an example of how you can use it:

from sklearn.impute import SimpleImputer

# Create an instance of the SimpleImputer class
imputer = SimpleImputer(strategy='mean')

# Select the column with missing values
X = df[['column_name']]

# Fit the imputer to the data and transform the data
X = imputer.fit_transform(X)

# Replace the missing values in the original DataFrame
df['column_name'] = X

This will replace all the missing values in df['column_name'] with the mean of the non-null values in the column.