custom function to extract and visualize outliers in python-CodePudding

For the dataframe below,I want to write a function that

I. extract the outliers for each column and export the output as a csv file (I need help with this one)

II. visualize using boxplot and export as pdf file

Outlier definition: as boundaries ±3 standard deviations from the mean

as being any point of data that lies over 1.5 IQRs below the first quartile (Q1) or above the third quartile (Q3)in a data set. High = (Q3) 1.5 IQR Low = (Q1) – 1.5 IQR

See below for the dataset and my attempt :

# dataset 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
  
# intialise data of lists.
data = {'region':['R1', 'R1', 'R2', 'R2', 'R2','R1','R1','R1','R2','R2'],
         'cost':[120.05, 181.90, 10.21, 133.01, 311.19,2003.4,112.4,763.2,414.8,812.5],
        'commission':[110.21, 191.12, 190.21,15.31, 245.09,63.41,811.3,10.34, 153.10, 311.17],
        'salary':[10022,19910, 19113,449999, 25519,140.29, 291.07, 390.22, 245.09, 4122.62],
        'revenue':[14029, 29100, 39022, 24509, 412271,110.21, 191.12, 190.21, 12.00, 245.09],
        'tax':[120.05, 181.90, 10.34, 153.10, 311.17,52119,32991,52883,69359,57835],
         'debt':[100.22,199.10, 191.13,199.99, 255.19,41218, 52991,1021,69152,79355],
         'income': [43211,7672991,56881,211,77342,100.22,199.10, 191.13,199.99, 255.19],
          'rebate': [31.21,429.01,538.18,621.58,6932.5,120.05, 181.90, 10.34, 153.10, 311.17],
        'scale':['small','small','small','mid','mid','large','large','mid','large','small']
        }
  
# Create DataFrame
df = pd.DataFrame(data)
  
# Print the output.
df

############## my attempt ####################


def outlier_extractor(data):
    # select numeric columns
    numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
    #(I)Extract and export outliers as csv..... I need help with this one
    
    #(II) boxplot visualization 
    
    plt.figure(figsize=(10, 9))
    
    for i, variable in enumerate(numeric_columns):
        plt.subplot(4, 4, i   1)
        plt.boxplot(data[variable],whis=1.5)
        plt.tight_layout()
        plt.title(variable)
    plt.savefig('graph_outliers.pdf')
    plt.show()

# driver code 
outlier_extractor(df)

Please comment and share your full code. Thanks in advance

CodePudding user response：

def outlier_extractor(data):
    numeric_data = data.select_dtypes(include=np.number)
    Q1, Q3 = numeric_data.quantile(.25), numeric_data.quantile(.75)
    IQR = Q3-Q1
    numeric_data[:] = np.where((numeric_data > Q3 1.5*IQR)|(numeric_data < Q1-1.5*IQR), np.nan, numeric_data)
    numeric_data.apply(lambda series:series.dropna().to_csv(series.name ".csv"))
    
    plt.figure(figsize=(10, 9))
    for i, variable in enumerate(numeric_columns):
        plt.subplot(4, 4, i   1)
        plt.boxplot(data[variable],whis=1.5)
        plt.tight_layout()
        plt.title(variable)
    #plt.savefig('graph_outliers.pdf')
    plt.show()

outlier_extractor(df)

Note that the apply function saves each filtered column in one different csv file. From your description I though that this was your task.

Note also that you don't need the seaborn package

EDIT

To export all the filtered dataframe with missing values replacing the ouliers you have to replace the to_csv row with:

numeric_data.to_excel("filtered_numeric_data.xlsx")