Home > other >  Elimination of outliers with z-score method in Python
Elimination of outliers with z-score method in Python

Time:10-10

I am cleaning a dataset using the z-score with a threshold >3. Below is the code that I am using. As you can, I first calculate the mean and std. After the code goes in a loop and checks for every value the z-score and if it is greater than 3 and, if yes, the value is treated as an outlier which is first added to the list "outlier". At last the outlier list is deleted for the dataset.

"""SD MonthlyIncome"""
MonthlyIncome_std = df ['MonthlyIncome'].std()
MonthlyIncome_std

"""MEAN MonthlyIncome"""
MonthlyIncome_mean = df ['MonthlyIncome'].mean()
MonthlyIncome_mean

threshold = 3
outlier = [] 
for i in df ['MonthlyIncome']: 
    z = (i-MonthlyIncome_mean)/MonthlyIncome_std 
    if z >= threshold: 
        outlier.append(i)
        df = df[~df.MonthlyIncome.isin(outlier)]

The above code works fine, the fact is that I have to write it for every numerical column. I was trying to create a function that does the same and it is replicable for every numerical column. Below the function:

    for col in df.columns:
        if df[col].dtypes == 'float64' or df[col].dtypes == 'int64':
            threshold = 3
            outlier = []
            col_mean = col.mean()
            col_std = col.std()
            z = (i-col_mean)/col_std
            if z >= threshold: 
                outlier.append(i) 
                df = df[~df.col.isin(outlier)]
AttributeError                            Traceback (most recent call last)
<ipython-input-62-4f8b1224061e> in <module>
----> 1 z_score_elimination(df)

<ipython-input-61-dc3c84b60dd1> in z_score_elimination(df)
      4             threshold = 3
      5             outlier = []
----> 6             col_mean = col.mean()
      7             col_std = col.std()
      8             z = (i-col_mean)/col_std

AttributeError: 'str' object has no attribute 'mean'

How can I fix the code?

CodePudding user response:

You are iterating over column names, which are string, not the actual columns. Try

df[col].mean()

CodePudding user response:

col is the string of the column name. I think you want to do col_mean = df[col].mean() and col_std = df[col].std()

  • Related