I have a dataframe that has several features and a feature can have a NaN-value. E.g.
feature1 feature2 feature3 feature4
10 NaN 5 2
2 1 3 1
NaN 2 4 NaN
Note: the columns can also contain strings.
How could we get a list/array per row that contains the column name of non NaN-values?
Thus the result array of my example would be:
res = array([feature1, feature3, feature4], [feature1, feature2, feature3, feature4],
[feature2, feature3])
CodePudding user response:
You can stack
to keep only the non-NAN values, and aggregate as list with groupby.agg
:
out = df.stack().reset_index().groupby('level_0')['level_1'].agg(list)
Output as Series:
level_0
0 [feature1, feature3, feature4]
1 [feature1, feature2, feature3, feature4]
2 [feature2, feature3]
Name: level_1, dtype: object
As lists:
out = (df.stack().reset_index().groupby('level_0')['level_1']
.agg(list).to_numpy().tolist()
)
Output:
[['feature1', 'feature3', 'feature4'],
['feature1', 'feature2', 'feature3', 'feature4'],
['feature2', 'feature3']]
CodePudding user response:
For improve performance use list comprehension with convert values to numpy array:
c = df.columns.to_numpy()
res = [c[x].tolist() for x in df.notna().to_numpy()]
print (res)
[['feature1', 'feature3', 'feature4'],
['feature1', 'feature2', 'feature3', 'feature4'],
['feature2', 'feature3']]
df = pd.concat([df] * 1000, ignore_index=True)
In [28]: %%timeit
...: out = (df.stack().reset_index().groupby('level_0')['level_1']
...: .agg(list).to_numpy().tolist()
...: )
...:
...:
96.5 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [29]: %%timeit
...: c = df.columns.to_numpy()
...: res = [c[x].tolist() for x in df.notna().to_numpy()]
...:
3.36 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)