given a following dataframe
A B C
0 1 Nan Nan
1 Nan 2 Nan
2 Nan Nan 3
I want to get a column that gets all the column names of non null rows
expected result
A B C D
0 1 2 Nan [A,B]
1 Nan 2 4 [B,C]
2 4 Nan 3 [A,C]
Would there be any solution using python?
CodePudding user response:
Use:
df['D'] = df.apply(lambda x: x.dropna().index.to_list(), axis=1)
CodePudding user response:
Use DataFrame.dot
with columns names with separator, remove last separator and splitting:
df['new'] = df.notna().dot(df.columns ',').str[:-1].str.split(',')
print (df)
A B C new
0 1.0 2.0 NaN [A, B]
1 NaN 2.0 4.0 [B, C]
2 4.0 NaN 3.0 [A, C]
If performance is important use list comprehension with mask converted to numpy:
cols = df.columns.to_numpy()
df['new'] = [cols[x] for x in df.notna().to_numpy()]
Solution from comments is fine, if need joined values by ,
:
df['new'] = df.notna().dot(df.columns ',').str.rstrip(',')
print (df)
A B C new
0 1.0 2.0 NaN A,B
1 NaN 2.0 4.0 B,C
2 4.0 NaN 3.0 A,C
Performance:
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
#https://stackoverflow.com/a/69704953/2901002
In [78]: %timeit df['D'] = df.apply(lambda x: x.dropna().index.to_list(), axis=1)
3.24 s ± 87.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit df['new'] = df.notna().dot(df.columns ',').str[:-1].str.split(',')
45.7 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %%timeit
...: cols = df.columns.to_numpy()
...: df['new'] = [cols[x] for x in df.notna().to_numpy()]
...:
32.2 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)