Home > front end >  How to delete outliers on linear scatter plot
How to delete outliers on linear scatter plot

Time:03-30

import matplotlib.pyplot as plt
import pandas as pd
import math 

I want to delete the outliers outside of the green lines from my data. My actually data is in a data frame and there are a lot of outliers. It would also be helpful to scale the width of the green lines for larger values of x.

x = [1,1.1,2,3,4,5,5.5,6,7,8,9,10, 10.10]
y = [1,0.1,2,3,4,5,1,6,7,8,9,10, 16]

df = pd.DataFrame(list(zip(x, y)),
               columns =['x_vals', 'y_vals'])

plt.scatter(df.x_vals,df.y_vals)
z = np.polyfit(df.x_vals,df.y_vals, 1)
p = np.poly1d(z)
z = plt.plot(df.x_vals,p(df.x_vals),"r--")
z_1 = plt.plot(df.x_vals,p(df.x_vals)   1,"g--")
z_2 = plt.plot(df.x_vals,p(df.x_vals) - 1,"g--")

plt.show()

enter image description here

CodePudding user response:

You can select the outliers as the points for which the y-value is either larger than the upper limit or smaller than the lower limit.

The following code example encircles the leftover points in purple, and crosses out the outliers.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

###import math

x = [1, 1.1, 2, 3, 4, 5, 5.5, 6, 7, 8, 9, 10, 10.10]
y = [1, 0.1, 2, 3, 4, 5, 1, 6, 7, 8, 9, 10, 16]

df = pd.DataFrame(list(zip(x, y)),
                  columns=['x_vals', 'y_vals'])

plt.scatter(df.x_vals, df.y_vals)
z = np.polyfit(df.x_vals, df.y_vals, 1)
p = np.poly1d(z)
z = plt.plot(df.x_vals, p(df.x_vals), "r--")
delta = 1
z_1 = plt.plot(df.x_vals, p(df.x_vals)   delta, "g--")
z_2 = plt.plot(df.x_vals, p(df.x_vals) - delta, "g--")
plt.fill_between(df.x_vals, p(df.x_vals) - delta, p(df.x_vals)   delta, color='g', alpha=0.1)

outliers = df[(df.y_vals < p(df.x_vals) - delta) | (df.y_vals > p(df.x_vals)   delta)]
cleaned = df[(df.y_vals >= p(df.x_vals) - delta) & (df.y_vals <= p(df.x_vals)   delta)]

plt.scatter(cleaned.x_vals, cleaned.y_vals, s=100, fc='none', ec='purple')
plt.scatter(outliers.x_vals, outliers.y_vals, marker='x', s=200, fc='none', ec='r')

plt.show()

identifying outliers

  • Related