Is there a way to automate data cleaning for pandas DataFrames?-CodePudding

I am cleaning my data for a machine learning project by replacing the missing values with the zeros and the mean for the 'Age' and 'Fare' columns respectively. The code for which is given below:

train_data['Age'] = train_data['Age'].fillna(0) 
mean = train_data['Fare'].mean()    
train_data['Fare'] = train_data['Fare'].fillna(mean)

Since I would I have to do this multiple times for other sets of data, I want to automate this process by creating a generic function that takes the DataFrame as input and performs the operations for modifying it and returning the modified function. The code for that is given below:

def data_cleaning(df):
    df['Age'] = df['Age'].fillna(0)
    fare_mean = df['Fare'].mean()
    df['Fare'] = df['Fare'].fillna()
    return df

However when I pass the training data DataFrame:

train_data = data_cleaning(train_data)

I get the following error:

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: 
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-  
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
      1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
      3 cross_val_data = data_cleaning(cross_val_data)

/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
      2     df['Age'] = df['Age'].fillna(0)
      3     fare_mean = df['Fare'].mean()
----> 4     df['Fare'] = df['Fare'].fillna()
      5     return df

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, 
**kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value, 
method, axis, inplace, limit, downcast)
   4820             inplace=inplace,
   4821             limit=limit,
-> 4822             downcast=downcast,
   4823         )
   4824 

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value, 
method, axis, inplace, limit, downcast)
   6311         """
   6312         inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313         value, method = validate_fillna_kwargs(value, method)
   6314 
   6315         self._consolidate_inplace()

/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in 
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
        368 
        369     if value is None and method is None:
    --> 370         raise ValueError("Must specify a fill 'value' or 'method'.")
        371     elif value is None and method is not None:
        372         method = clean_fill_method(method)

    ValueError: Must specify a fill 'value' or 'method'.

On some research, I found that I would have to use apply() and map() functions instead, but I am not sure how to input the mean value of the column. Furthermore, this does not scale well as I would have to calculate all the fillna values before inputting them into the function, which is cumbersome. Therefore I want to ask, is there better way to automate data cleaning?

CodePudding user response：

This line df['Fare'] = df['Fare'].fillna() in your function, you did not fill the n/a with anything, thus it returns an error. You should change it to df['Fare'] = df['Fare'].fillna(fare_mean).

If you intend to make this usable for another file in same directory, you can just call it in another file by:

from file_that_contain_function import function_name

And if you intend to make it reusable for your workspace/virtual environment, you may need to create your own python package.

CodePudding user response：

So yes, the other answer explains where the error is coming from.

However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to

def data_cleaning(df):
    df['Age'] = df.loc[:, 'Age'].fillna(0)
    fare_mean = df['Fare'].mean()
    df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean)  # <- and also fix this error
    return df

I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.