I'm trying to remove outliers from a dataset, where an outlier is if the difference between one item and the next one is larger than 3 * the uncertainty on the item
def remove_outliers(data):
for i in data:
x = np.where(abs(i[1] - (i 1)[1]) > 3( * data[:,2]))
data_outliers_removed = np.delete(data, x, axis =1)
return data_outliers_removed
is the function which I tried to use, however it either deletes no values or all values when I've played around with it.
CodePudding user response:
i would maybe do something like this by working with a new empty array.
def remove_outliers(dataset):
filtered_dataset = []
for index, item in enumerate(dataset):
if index == 0:
filtered_dataset.append(item)
else:
if abs(item[0] - dataset[index - 1][0]) <= 3 * dataset[index - 1][1]:
filtered_dataset.append(item)
return filtered_dataset
Of course the same can be achieved easily with numpy. Hope that helps
CodePudding user response:
Iterating over a numpy array is usually a code-smell, since you reject numpy's super-fast indexing and slicing abilities for python's slow loops. I'm assuming data
is a numpy array since you've used it like one.
Your criterion for an outlier is:
if the difference between one item and the next one is larger than 3 * the uncertainty on the item
From your usage, it appears the "items" are in the data[:, 1]
column, and the uncertainties are in the data[:, 2]
column.
The difference between an item and the next one is easy to obtain using np.diff
, so our condition becomes:
np.diff(data[:, 1]) > 3 * data[:-1, 2]
I skipped the last uncertainty by doing data[:-1, 2]
because the last uncertainty doesn't matter -- the last item doesn't have a "next element". I'm going to consider that it is an outlier and filter it out, but I've also shown how to filter it in if you want.
We will use boolean indexing to filter out the rows we don't want in our array:
def remove_outliers(data):
select_mask = np.zeros(data[:, 1].shape, dtype=bool) # Make an array of Falses
# Since the default value of the mask is False, items are considered outliers
# and therefore filtered out unless we calculate the value for the mask
# If you want to consider the opposite, do `np.ones(...)`
# Only calculate the value for the mask for the first through the second-last item
select_mask[:-1] = np.diff(data[:, 1]) > 3 * data[:-1, 2]
# Select only those rows where select_mask is True
# And select all columns
filtered_data = data[select_mask, :]
return filtered_data