I have a table with 3 columns - a, b, and c. How do I add a 4th column (d) which stores the column names of dates in sorted order (across columns) while ignoring the NaT cells - example shown below:
a b c d
|:------------:|:------------:|:-------------:|:-----------:|
| 2022-01-15 | 2022-01-09 | 2022-01-13 | [b, c, a] |
| 2022-02-11 | NaT | 2022-02-20 | [a, c] |
| 2022-02-15 | 2022-02-14 | NaT | [b, a] |
For the first row, value in column d is [b,c,a] since the value in b < value in c < value in a
CodePudding user response:
Use np.argsort
for positions columns with remove missi values and convert to lists in lambda function:
df['d'] = df.apply(lambda x: list(df.columns[np.argsort(x.dropna())]), axis=1)
Or sorting per rows, remove NaNs and convert index to lists:
df['d'] = df.apply(lambda x: x.sort_values().dropna().index.tolist(), 1)
Or reshape by DataFrame.stack
for remove missing values, sorting by 2 columns and aggregate list
s:
df['d'] = (df.stack()
.rename_axis(['idx','c'])
.reset_index(name='val')
.sort_values(['idx', 'val'])
.groupby('idx')['c']
.agg(list))
print (df)
a b c d
0 2022-01-15 2022-01-09 2022-01-13 [b, c, a]
1 2022-02-11 NaT 2022-02-20 [a, c]
2 2022-02-15 2022-02-14 NaT [b, a]