Home > other >  How do I drop outliers across axis 1 in python's Pandas?
How do I drop outliers across axis 1 in python's Pandas?

Time:09-03

I have a datframe of so many rows and 2000 columns of samples for each row. Each row consists of a product and each column consists of one point in a distribution of possible sales.

How do I drop all points of each distribution (across columns) outside of the 2.5% and 97.5% of the rows? I'd like to take the mean across axis=1 without having the outliers in the data. I need to do this for each product (row).

Here is some random data

import numpy
import pandas 

cols = np.random.rand(10, 2000)
df = pd.DataFrame(cols)

I tried:

df.quantile([.025,.975],axis=1) but that puts the products as column and just the 2.5% and 97.5% values.

CodePudding user response:

You can use np.where and take only non-outlier columns into consideration:

def no_outliers(x):
     return np.where((x >= x.quantile(0.025)) & (x <= x.quantile(0.975)))[0]

df.apply(lambda x: x[no_outliers(x)].mean(), axis=1)
  • Related