I'm trying to write a function to fill missing data in a Pandas Dataframe. The input of the function is a dataframe with missing values and the column name that I would like the missing value to be filled, and it would return a new datafrme with the missing values filled. The problems is that function would also fill the missing values of the input dataframe, which what I'm not intended to do. Please see my codes below:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
table = pd.DataFrame({'feature1':[3,5,np.nan],'feature2':[4,1,np.nan],'feature3': [6,7,3]})
def missingValueHandle(dataframe,feature):
df = dataframe
df[feature] = df[feature].fillna(axis = 0, method = 'ffill')
imp = SimpleImputer(strategy = 'mean')
df = imp.fit_transform(df)
return df
new_dataframe = missingValueHandle(dataframe=table,feature = 'feature1')
new_dataframe
feature1 | feature2 | feature3 | |
---|---|---|---|
0 | 3.0 | 4.0 | 6 |
1 | 5.0 | 1.0 | 7 |
2 | 5.0 | NaN | 3 |
table
feature1 | feature2 | feature3 | |
---|---|---|---|
0 | 3.0 | 4.0 | 6 |
1 | 5.0 | 1.0 | 7 |
2 | 5.0 | NaN | 3 |
As you can see, my input "table" is changing with the output "new_dataframe", what do I need to do to prevent that from happening?
CodePudding user response:
Use the assign
method instead of assigning to the passed dataframe.
.assign
always returns a new dataframe.
def missingValueHandle(dataframe, feature):
return (
dataframe
.assign(**{feature: lambda df: df[feature].ffill()})
.pipe(SimpleImputer(strategy='mean').fit_transform))
In this case this can also be done without a lambda:
def missingValueHandle(dataframe, feature):
return (
dataframe
.assign(**{feature: dataframe[feature].ffill()})
.pipe(SimpleImputer(strategy='mean').fit_transform))
The benefit of the lambda approach is that you can add a row filter in the pipeline before the assign
and it still works