For the dataframe below,I want to write a function that
I. extract the outliers for each column and export the output as a csv file (I need help with this one)
II. visualize using boxplot and export as pdf file
Outlier definition: as boundaries ±3 standard deviations from the mean
OR
as being any point of data that lies over 1.5 IQRs below the first quartile (Q1) or above the third quartile (Q3)in a data set. High = (Q3) 1.5 IQR Low = (Q1) – 1.5 IQR
See below for the dataset and my attempt :
# dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# intialise data of lists.
data = {'region':['R1', 'R1', 'R2', 'R2', 'R2','R1','R1','R1','R2','R2'],
'cost':[120.05, 181.90, 10.21, 133.01, 311.19,2003.4,112.4,763.2,414.8,812.5],
'commission':[110.21, 191.12, 190.21,15.31, 245.09,63.41,811.3,10.34, 153.10, 311.17],
'salary':[10022,19910, 19113,449999, 25519,140.29, 291.07, 390.22, 245.09, 4122.62],
'revenue':[14029, 29100, 39022, 24509, 412271,110.21, 191.12, 190.21, 12.00, 245.09],
'tax':[120.05, 181.90, 10.34, 153.10, 311.17,52119,32991,52883,69359,57835],
'debt':[100.22,199.10, 191.13,199.99, 255.19,41218, 52991,1021,69152,79355],
'income': [43211,7672991,56881,211,77342,100.22,199.10, 191.13,199.99, 255.19],
'rebate': [31.21,429.01,538.18,621.58,6932.5,120.05, 181.90, 10.34, 153.10, 311.17],
'scale':['small','small','small','mid','mid','large','large','mid','large','small']
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
############## my attempt ####################
def outlier_extractor(data):
# select numeric columns
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
#(I)Extract and export outliers as csv..... I need help with this one
#(II) boxplot visualization
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.savefig('graph_outliers.pdf')
plt.show()
# driver code
outlier_extractor(df)
Please comment and share your full code. Thanks in advance
CodePudding user response:
def outlier_extractor(data):
numeric_data = data.select_dtypes(include=np.number)
Q1, Q3 = numeric_data.quantile(.25), numeric_data.quantile(.75)
IQR = Q3-Q1
numeric_data[:] = np.where((numeric_data > Q3 1.5*IQR)|(numeric_data < Q1-1.5*IQR), np.nan, numeric_data)
numeric_data.apply(lambda series:series.dropna().to_csv(series.name ".csv"))
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
#plt.savefig('graph_outliers.pdf')
plt.show()
outlier_extractor(df)
Note that the apply
function saves each filtered column in one different csv
file. From your description I though that this was your task.
Note also that you don't need the seaborn
package
EDIT
To export all the filtered dataframe with missing values replacing the ouliers you have to replace the to_csv
row with:
numeric_data.to_excel("filtered_numeric_data.xlsx")