Home > Enterprise >  How to get column names of all nonnull row values?
How to get column names of all nonnull row values?

Time:10-25

given a following dataframe

    A   B   C
0   1   Nan Nan
1  Nan  2   Nan
2  Nan  Nan 3

I want to get a column that gets all the column names of non null rows

expected result

   A   B   C   D
0   1   2 Nan   [A,B]
1  Nan  2   4   [B,C]
2   4  Nan 3    [A,C]

Would there be any solution using python?

CodePudding user response:

Use:

df['D'] = df.apply(lambda x: x.dropna().index.to_list(), axis=1)

CodePudding user response:

Use DataFrame.dot with columns names with separator, remove last separator and splitting:

df['new'] = df.notna().dot(df.columns ',').str[:-1].str.split(',')
print (df)
     A    B    C     new
0  1.0  2.0  NaN  [A, B]
1  NaN  2.0  4.0  [B, C]
2  4.0  NaN  3.0  [A, C]

If performance is important use list comprehension with mask converted to numpy:

cols = df.columns.to_numpy()
df['new'] = [cols[x] for x in df.notna().to_numpy()]

Solution from comments is fine, if need joined values by ,:

df['new'] = df.notna().dot(df.columns ',').str.rstrip(',') 
print (df)
     A    B    C  new
0  1.0  2.0  NaN  A,B
1  NaN  2.0  4.0  B,C
2  4.0  NaN  3.0  A,C

Performance:

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)

#https://stackoverflow.com/a/69704953/2901002
In [78]: %timeit df['D'] = df.apply(lambda x: x.dropna().index.to_list(), axis=1)
3.24 s ± 87.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: %timeit df['new'] = df.notna().dot(df.columns ',').str[:-1].str.split(',')
45.7 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [80]: %%timeit
    ...: cols = df.columns.to_numpy()
    ...: df['new'] = [cols[x] for x in df.notna().to_numpy()]
    ...: 
32.2 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Related