I'm wondering how to remove not representative samples from the population.
Let's assume that I have following numbers as a list:
pop = [1, 2, 3, 4, 1, 3, 3, 4, 5, 7, 2, 1, 1, 1, 1, 1, 1002]
Is there any way to inspect the list and remove value 1002 as it is completely not representative and can disturb calculations?
To illustrate this let make the following computation:
>>> pop = [1, 2, 3, 4, 1, 3, 3, 4, 5, 7, 2, 1, 1, 1, 1, 1, 1002]
>>> mean(pop)
61.294117647058826
>>> pop1 = [1, 2, 3, 4, 1, 3, 3, 4, 5, 7, 2, 1, 1, 1, 1, 1]
>>> mean(pop1)
2.5
CodePudding user response:
data = {"pop": [1, 2, 3, 4, 1,3,3,4,5,7,2,1,1,1,1,1,1002]}
df = pd.DataFrame(data)
q3 = np.quantile(df["pop"], 0.75)
q1 = np.quantile(df["pop"], 0.25)
iqr = q3 - q1
upper_bound = q3 1.5 * iqr
lower_bound = q1 - 1.5 * iqr
df_wo_outliers = df[(df["pop"] >= lower_bound) & (df["pop"] <= upper_bound)]
But be aware that it could also remove 7 for instance as your upper bound is 5.5
df_wo_outliers["pop"].mean()
# 2.5