I've searched some previous answers about this, but I want to know one thing more about it.
This is the test data:
df = pd.DataFrame({"a":[2,3,4,5,6,8],"b":[3,4,np.nan,6,111,22], "c" : [2,3,4,5, 777,1]})
And this is the one I'm working with to check if the value is outliers per column
def check_outliers(df, domain_list):
outlier_column = []
for domain in domain_list:
Q1 = df[domain].quantile(0.25)
Q3 = df[domain].quantile(0.75)
IQR = Q3 - Q1
min_v = Q1 - (1.5 * IQR)
max_v = Q3 (1.5 * IQR)
df["No_outliers_" domain] = np.where(np.isnan(df[domain]),"-",np.where((df[domain] >= min_v) & (df[domain] <= max_v), "O",domain))
outlier_column.append("No_outliers_" domain)
#df["No_outliers"] = np.where()
df = df.drop(outlier_column, axis=1)
return df
df = check_outliers(df,["a","b","c"])
I see that many recommends to use np.where
or np.select
for this, but what I want to know more is that dealing with multiple columns for the condition.
I want to make "No_outliers" column which contains the column names if the value of the column is an outlier. Also "-" for marking np.nan
value.
So it should have "No_outliers" : ["","","b","","b, c",""]
since 111
in column "b"
and 777
in column "c"
would be an outlier in each column.
I think I can use .any()
here but I couldn't. I must have used it in a wrong way.
Hope you can help me with this.
Thank you!
CodePudding user response:
We can combine multiple boolean masks using logical OR
to create a resulting mask where a True values represent an outlier, then take the dot product of this mask with the columns and assign the result to No_outliers
column
mask = df[domain].lt(min_v) | df[domain].gt(max_v) | df[domain].isna()
df['No_outliers'] = (mask @ (mask.columns ', ')).str[:-2]
Result
print(df)
a b c No_outliers
0 2 3.0 2
1 3 4.0 3
2 4 NaN 4 b
3 5 6.0 5
4 6 111.0 777 b, c
5 8 22.0 1