I am working on a pandas DataFrame containing numerical columns as well as string columns (dtype is object
), and would like to remove the rows containing outliers with respect to the distributions within a column. In other words, detect the outliers in each column and drop the corresponding rows.
I have found two solutions to this, but neither takes into account that my df does not contain only numbers, hence they both result in errors (when encountering strings, I assume).
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
returns TypeError: unsupported operand type(s) for /: 'str' and 'int'
.
This is why I guess the error arises from the df having mixed dtypes.
for col in df.columns:
lower = df[col].quantile(0.05)
upper = df[col].quantile(0.95)
df = df[col].clip(lower=lower, upper=upper)
returns KeyError
with this traceback:
File omissis, in Class.remove_outliers(self, df)
423 def remove_outliers(self, df):
424 for col in df.columns:
--> 425 lower = df[col].quantile(0.05)
426 upper = df[col].quantile(0.95)
427 df = df[col].clip(lower=lower, upper=upper)
File omissis, in Series.__getitem__(self, key)
955 return self._values[key]
957 elif key_is_scalar:
--> 958 return self._get_value(key)
960 if is_hashable(key):
961 # Otherwise index.get_value will raise InvalidIndexError
962 try:
963 # For labels that don't resolve as scalars like tuples and frozensets
File omissis, in Series._get_value(self, label, takeable)
1066 return self._values[label]
1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
1070 return self.index._get_values_for_loc(self, loc, label)
File omissis, in RangeIndex.get_loc(self, key, method, tolerance)
387 raise KeyError(key) from err
388 self._check_indexing_error(key)
--> 389 raise KeyError(key)
390 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 'colname'
How would you solve this?
EDIT: the idea is to skip the non numeric columns, to ignore them.
CodePudding user response:
I would break the problem into stages:
Firstly, identify (numeric) columns you want to do the outlier removal. Reference
newdf = df.select_dtypes(include=np.number)
Now perform whatever filtering/outlier removal you want on the rows of newdf
. Afterwards, newdf
should contain only rows you wish to retain.
Then keep only the rows of df
those index are in newdf
. Reference
df = df[df.index.isin(newdf.index)]
CodePudding user response:
@Ipounng 's solution in copy-paste ready code:
def remove_outliers(df):
newdf = df.select_dtypes(include=np.number)
newdf = newdf[(np.abs(stats.zscore(newdf)) < 3).all(axis=1)]
df = df[df.index.isin(newdf.index)]
return df
CodePudding user response:
In addition to @lpounng solution. For categorical variable you cannot use zscore, but you could consider low valued class as outliers. You can do it setting a threshold for the value counts.
Example with a toy dataset:
import random
import pandas as pd
colors = []
for i in range(100):
colors.append(random.choices(['yellow','white', 'red'], weights = [10, 1, 2])[0])
df = pd.DataFrame(colors, columns=['colors'])
I randomly generate a columns with yellow, white and red categorical values with a weights of 10,1,2.
With value_count()
pandas method you can compute the counts of unique category in the column
df['colors'].value_counts()
>>> yellow 68
red 20
white 12
Name: colors, dtype: int64
Now you can set a threshold and remove the categories that are sparsely populated such us white.