Home > Software design >  how to apply a class function to replace NaN for mean within a subset of pandas df columns?
how to apply a class function to replace NaN for mean within a subset of pandas df columns?

Time:09-17

The class is composed of a set of attributes and functions including:

Attributes:

  • df : a pandas dataframe.
  • numerical_feature_names: df columns with a numeric value.
  • label_column_names: df string columns to be grouped.

Functions:

  • mean(nums): takes a list of numbers as input and returns the mean
  • fill_na(df, numerical_feature_names, label_columns): takes class attributes as inputs and returns a transformed df.

And here's the class:


class PLUMBER():
    
    def __init__(self):
        
        ################# attributes ################
        
        self.df=df

        # specify label and numerical features names:
        
        self.numerical_feature_names=numerical_feature_names
        self.label_column_names=label_column_names
        
   
    #####################  mean  ##############################
    
    def mean(self, nums):
        
        total=0.0
        
        for num in nums:
            total=total num
            
        return total/len(nums)
    

   ############ fill the numerical features ##################
   
    def fill_na(self, df, numerical_feature_names, label_column_names):
        
        # declaring parameters:
        df=self.df
        numerical_feature_names=self.numerical_feature_names
        label_column_names=self.label_column_names
        
        # now replacing NaN with group mean
        
        for numerical_feature_name in numerical_feature_names:
            
            df[numerical_feature_name]=df.groupby([label_column_names]).transform(lambda x: x.fillna(self.mean(x)))
        
            
        return df

When trying to apply it to a pandas df:

if __name__=="__main__":
    
    # initialize class
    plumber=PLUMBER()
    
    # replace NaN with group mean
    df=plumber.fill_na(df=df, numerical_feature_names=numerical_feature_names, label_column_names=label_column_names)
  

The next error arises:

ValueError: Grouper and axis must be same length

data and class parameters

import pandas as pd

d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'], 
   'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'], 
   'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
   'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
   'number':[np.nan, 450, 299, np.nan, 19, 29],
   'age':[np.nan, 30, 28, np.nan, 29, 18]}

df=pd.DataFrame(d)


# headers
column_names=df.columns.values.tolist()
column_names= [column_name.strip() for column_name in column_names]


# label_column_names (to be grouped)
label_column_names=['country', 'level', 'job title']


# numerical_features:
numerical_feature_names = [x for x in column_names if x not in label_column_names]
numerical_feature_names.remove('month')

How could I change the class in order to get the transformed df (i.e. the one that replaces np.nan with it's group mean)?

CodePudding user response:

First the error is because label_column_names is already a list, so in the groupby you don't need the [] around it. so it should be df.groupby(label_column_names)... instead of df.groupby([label_column_names])...

Now, to actually solve you problem, in the function fill_na of your class, replace the loop for (you don't need it actually) by

df[numerical_feature_names] = (
    df[numerical_feature_names]
      .fillna(
          df.groupby(label_column_names)
            [numerical_feature_names].transform('mean')
      )
)

in which you fillna the columns numerical_feature_names by the result of the groupy.tranform with the mean of these columns

  • Related