Home > Enterprise >  Sort data across columns per row and return column names in new variable
Sort data across columns per row and return column names in new variable

Time:02-19

I have a table with 3 columns - a, b, and c. How do I add a 4th column (d) which stores the column names of dates in sorted order (across columns) while ignoring the NaT cells - example shown below:

       a              b               c              d
|:------------:|:------------:|:-------------:|:-----------:|
|  2022-01-15  |  2022-01-09  |   2022-01-13  |  [b, c, a]  |
|  2022-02-11  |      NaT     |   2022-02-20  |   [a, c]    |
|  2022-02-15  |  2022-02-14  |       NaT     |   [b, a]    |

For the first row, value in column d is [b,c,a] since the value in b < value in c < value in a

CodePudding user response:

Use np.argsort for positions columns with remove missi values and convert to lists in lambda function:

df['d'] = df.apply(lambda x: list(df.columns[np.argsort(x.dropna())]), axis=1)

Or sorting per rows, remove NaNs and convert index to lists:

df['d'] = df.apply(lambda x: x.sort_values().dropna().index.tolist(), 1)

Or reshape by DataFrame.stack for remove missing values, sorting by 2 columns and aggregate lists:

df['d'] = (df.stack()
             .rename_axis(['idx','c'])
             .reset_index(name='val')
             .sort_values(['idx', 'val'])
             .groupby('idx')['c']
             .agg(list))

print (df)
           a          b          c          d
0 2022-01-15 2022-01-09 2022-01-13  [b, c, a]
1 2022-02-11        NaT 2022-02-20     [a, c]
2 2022-02-15 2022-02-14        NaT     [b, a]
  • Related