How to prevent inplace operation of Pandas dataframe?-CodePudding

I'm trying to write a function to fill missing data in a Pandas Dataframe. The input of the function is a dataframe with missing values and the column name that I would like the missing value to be filled, and it would return a new datafrme with the missing values filled. The problems is that function would also fill the missing values of the input dataframe, which what I'm not intended to do. Please see my codes below:

    import pandas as pd
    import numpy as np
    from sklearn.impute import SimpleImputer
    table = pd.DataFrame({'feature1':[3,5,np.nan],'feature2':[4,1,np.nan],'feature3':   [6,7,3]})

    def missingValueHandle(dataframe,feature):
        df = dataframe
        df[feature] = df[feature].fillna(axis = 0, method = 'ffill')
        imp = SimpleImputer(strategy = 'mean')
        df = imp.fit_transform(df)
        return df

    new_dataframe = missingValueHandle(dataframe=table,feature = 'feature1')
    new_dataframe

	feature1	feature2	feature3
0	3.0	4.0	6
1	5.0	1.0	7
2	5.0	NaN	3

    table

	feature1	feature2	feature3
0	3.0	4.0	6
1	5.0	1.0	7
2	5.0	NaN	3

As you can see, my input "table" is changing with the output "new_dataframe", what do I need to do to prevent that from happening?

CodePudding user response：

Use the assign method instead of assigning to the passed dataframe.

.assign always returns a new dataframe.

def missingValueHandle(dataframe, feature):
  return (
    dataframe
    .assign(**{feature: lambda df: df[feature].ffill()})
    .pipe(SimpleImputer(strategy='mean').fit_transform))

In this case this can also be done without a lambda:

def missingValueHandle(dataframe, feature):
  return (
    dataframe
    .assign(**{feature: dataframe[feature].ffill()})
    .pipe(SimpleImputer(strategy='mean').fit_transform))

The benefit of the lambda approach is that you can add a row filter in the pipeline before the assign and it still works