Removing outliers in a df containing mixed dtype-CodePudding

I am working on a pandas DataFrame containing numerical columns as well as string columns (dtype is object), and would like to remove the rows containing outliers with respect to the distributions within a column. In other words, detect the outliers in each column and drop the corresponding rows.

I have found two solutions to this, but neither takes into account that my df does not contain only numbers, hence they both result in errors (when encountering strings, I assume).

Way 1:

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

returns TypeError: unsupported operand type(s) for /: 'str' and 'int'. This is why I guess the error arises from the df having mixed dtypes.

Way 2:

for col in df.columns:
    lower = df[col].quantile(0.05)
    upper = df[col].quantile(0.95)
    df = df[col].clip(lower=lower, upper=upper)

returns KeyError with this traceback:

File omissis, in Class.remove_outliers(self, df)
    423 def remove_outliers(self, df):
    424     for col in df.columns:
--> 425         lower = df[col].quantile(0.05)
    426         upper = df[col].quantile(0.95)
    427         df = df[col].clip(lower=lower, upper=upper)

File omissis, in Series.__getitem__(self, key)
    955     return self._values[key]
    957 elif key_is_scalar:
--> 958     return self._get_value(key)
    960 if is_hashable(key):
    961     # Otherwise index.get_value will raise InvalidIndexError
    962     try:
    963         # For labels that don't resolve as scalars like tuples and frozensets

File omissis, in Series._get_value(self, label, takeable)
   1066     return self._values[label]
   1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
   1070 return self.index._get_values_for_loc(self, loc, label)

File omissis, in RangeIndex.get_loc(self, key, method, tolerance)
    387             raise KeyError(key) from err
    388     self._check_indexing_error(key)
--> 389     raise KeyError(key)
    390 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 'colname'

How would you solve this?

EDIT: the idea is to skip the non numeric columns, to ignore them.

CodePudding user response：

I would break the problem into stages:

Firstly, identify (numeric) columns you want to do the outlier removal. Reference

newdf = df.select_dtypes(include=np.number)

Now perform whatever filtering/outlier removal you want on the rows of newdf. Afterwards, newdf should contain only rows you wish to retain.

Then keep only the rows of df those index are in newdf. Reference

df = df[df.index.isin(newdf.index)]

CodePudding user response：

@Ipounng 's solution in copy-paste ready code:

def remove_outliers(df):
    newdf = df.select_dtypes(include=np.number)
    newdf = newdf[(np.abs(stats.zscore(newdf)) < 3).all(axis=1)]
    df = df[df.index.isin(newdf.index)]
    return df

CodePudding user response：

In addition to @lpounng solution. For categorical variable you cannot use zscore, but you could consider low valued class as outliers. You can do it setting a threshold for the value counts.

Example with a toy dataset:

import random
import pandas as pd
colors = []
for i in range(100):
    colors.append(random.choices(['yellow','white', 'red'],  weights = [10, 1, 2])[0])
df = pd.DataFrame(colors, columns=['colors'])

I randomly generate a columns with yellow, white and red categorical values with a weights of 10,1,2.

With value_count() pandas method you can compute the counts of unique category in the column

df['colors'].value_counts()
>>> yellow    68
    red       20
    white     12
    Name: colors, dtype: int64

Now you can set a threshold and remove the categories that are sparsely populated such us white.