I have a datframe of so many rows and 2000 columns of samples for each row. Each row consists of a product and each column consists of one point in a distribution of possible sales.
How do I drop all points of each distribution (across columns) outside of the 2.5% and 97.5% of the rows? I'd like to take the mean across axis=1
without having the outliers in the data. I need to do this for each product (row).
Here is some random data
import numpy
import pandas
cols = np.random.rand(10, 2000)
df = pd.DataFrame(cols)
I tried:
df.quantile([.025,.975],axis=1) but that puts the products as column and just the 2.5% and 97.5% values.
CodePudding user response:
You can use np.where
and take only non-outlier columns into consideration:
def no_outliers(x):
return np.where((x >= x.quantile(0.025)) & (x <= x.quantile(0.975)))[0]
df.apply(lambda x: x[no_outliers(x)].mean(), axis=1)