I want a list of column name based on condition in a separate column in sorted form-CodePudding

I have a dataframe with x number of columns, for example-

|a   |b   |c   |d   |e   |f   |g   |h   |i   |median|
|----|----|----|----|----|----|----|----|----|------|
|0.10|0.08|0.29|0.13|0.03|0.29|0.06|0.03|0.06|0.08  |
|0.82|0.18|0.14|0.12|0.08|0.12|0.08|0.06|0.10|0.12  |
|0.57|0.17|0.13|0.14|0.05|0.16|0.05|0.04|0.09|0.13  |

I want another column named "possible_labels" where I have a list of all possible column names in sorted way where the condition fulfils dataframe[column_name] >= dataframe['median']

for example-

|a   |b   |c   |d   |e   |f   |g   |h   |i   |median|possible_labels|
|----|----|----|----|----|----|----|----|----|------|-------------|
|0.10|0.08|0.29|0.13|0.03|0.29|0.06|0.03|0.06|0.08  |[c,f,d,a,b]  |
|0.82|0.18|0.14|0.12|0.08|0.12|0.08|0.06|0.10|0.12  |[a,b,c,d,f]  |
|0.57|0.17|0.13|0.14|0.05|0.16|0.05|0.04|0.09|0.13  |[a,b,f,d,c]  |

in "possible_labels" columns, top 5 column names are not there, they are the column name which satisfy the condition

dataframe[column_name] >= dataframe['median']

where:

column_name = ['a','b','c','d','e','f','g','h','i']

CodePudding user response：

Use custom function with sorting and filtering:

def f(x):
    y = x[:-1].sort_values(ascending=False)
    return y.index[y >= x['median']].tolist()

df['possible_labels'] = df.apply(f, axis=1)
print (df)

      a     b     c     d     e     f     g     h     i  median  \
0  0.10  0.08  0.29  0.13  0.03  0.29  0.06  0.03  0.06    0.08   
1  0.82  0.18  0.14  0.12  0.08  0.12  0.08  0.06  0.10    0.12   
2  0.57  0.17  0.13  0.14  0.05  0.16  0.05  0.04  0.09    0.13   

   possible_labels  
0  [c, f, d, a, b]  
1  [a, b, c, d, f]  
2  [a, b, f, d, c]

Another solution with pandas functions - use DataFrame.melt for unpivot, filter by DataFrame.query, sorting by DataFrame.sort_values and last aggregate list by index:

column_name = ['a','b','c','d','e','f','g','h','i']

df['possible_labels'] =(df.melt(id_vars=['median'],value_vars=value_vars,ignore_index=False)
                         .query('value >= median')
                         .sort_values('value',ascending=False)
                         .groupby(level=0)['variable']
                         .agg(list))

print (df)
      a     b     c     d     e     f     g     h     i  median  \
0  0.10  0.08  0.29  0.13  0.03  0.29  0.06  0.03  0.06    0.08   
1  0.82  0.18  0.14  0.12  0.08  0.12  0.08  0.06  0.10    0.12   
2  0.57  0.17  0.13  0.14  0.05  0.16  0.05  0.04  0.09    0.13   

   possible_labels  
0  [c, f, d, a, b]  
1  [a, b, c, d, f]  
2  [a, b, f, d, c]