String merge in Dataframe if condition fits-CodePudding

I've searched some previous answers about this, but I want to know one thing more about it.

This is the test data:

df = pd.DataFrame({"a":[2,3,4,5,6,8],"b":[3,4,np.nan,6,111,22], "c" : [2,3,4,5, 777,1]})

And this is the one I'm working with to check if the value is outliers per column

def check_outliers(df, domain_list):
    
    outlier_column = []
    for domain in domain_list:
        Q1 = df[domain].quantile(0.25)
        Q3 = df[domain].quantile(0.75)
        IQR = Q3 - Q1
        min_v = Q1 - (1.5 * IQR)
        max_v = Q3   (1.5 * IQR)
    
        df["No_outliers_" domain] = np.where(np.isnan(df[domain]),"-",np.where((df[domain] >= min_v) & (df[domain] <= max_v), "O",domain))
        outlier_column.append("No_outliers_" domain)
    
    #df["No_outliers"] = np.where()

    df = df.drop(outlier_column, axis=1)
    return df

df = check_outliers(df,["a","b","c"])

I see that many recommends to use np.where or np.select for this, but what I want to know more is that dealing with multiple columns for the condition. I want to make "No_outliers" column which contains the column names if the value of the column is an outlier. Also "-" for marking np.nan value.

So it should have "No_outliers" : ["","","b","","b, c",""] since 111 in column "b" and 777 in column "c" would be an outlier in each column.

I think I can use .any() here but I couldn't. I must have used it in a wrong way.

Hope you can help me with this.

Thank you!

CodePudding user response：

We can combine multiple boolean masks using logical OR to create a resulting mask where a True values represent an outlier, then take the dot product of this mask with the columns and assign the result to No_outliers column

mask = df[domain].lt(min_v) | df[domain].gt(max_v) | df[domain].isna()
df['No_outliers'] = (mask @ (mask.columns   ', ')).str[:-2]

Result

print(df)

   a      b    c No_outliers
0  2    3.0    2            
1  3    4.0    3            
2  4    NaN    4           b
3  5    6.0    5            
4  6  111.0  777        b, c
5  8   22.0    1