pandas determine column labels that contribute to non-zero values in each row-CodePudding

For the dataframe shown below, I want to do the following :

In every row I want to determine the number of nonzero values, which I have been able to do using np.where.
I then want to determine the column labels that contribute to these non-zero values in each row. Any help will be really appreciated. Thanks in advance

CodePudding user response：

To count your non-zeros in each row you can use nonzero_count from numpy package and perform the operation row-wise:

import numpy as np
df['non_zero_count'] = np.count_nonzero(df,axis=1)

>>> df

      1     2     3     4     5     6      7  non_zero_count
0  8122     0     0     0     0     0      0               1
1     0     0     0  3292     0  1313      0               2
2     0  8675     0     0     0     0      0               1
3     0     0  1910     0   213     0  12312               3
4     0     0     0     0  4010     0      0               1
5     0     0     0     0     0  1002      0               1
6     0     0     0     0     0     0   1278               1

Then you can get the columns where a row contains a non-zero value with apply, so be cautious here if you have a big dataset at hand:

df['non_zero_label'] = df.drop('non_zero_count',axis=1)\
    .apply(lambda r: r.index[r.ne(0)].to_list(), axis=1)
df


>>> df

      1     2     3     4     5     6      7  non_zero_count non_zero_label
0  8122     0     0     0     0     0      0               1            [1]
1     0     0     0  3292     0  1313      0               2         [4, 6]
2     0  8675     0     0     0     0      0               1            [2]
3     0     0  1910     0   213     0  12312               3      [3, 5, 7]
4     0     0     0     0  4010     0      0               1            [5]
5     0     0     0     0     0  1002      0               1            [6]
6     0     0     0     0     0     0   1278               1            [7]

CodePudding user response：

Considering that the dataframe name is df:

df[df != 0].stack().reset_index(level=0, drop=True)

This returns a series with the column labels as index and the non-zero values, which for your example the output is:

1    8122.0
4    3292.0
2    8675.0
3    1910.0
5    4010.0
6    1002.0
7    1278.0

CodePudding user response：

You could do something like this:

df = pd.DataFrame({'x' : [ 0,0,9],'y' : [-1,3,0],'z' : [0,1.1,3]},index=['a','b','c'])
df["non_zero_columns"] = df.where(df == 0, 
                                  other=df.apply(lambda x: x.name), 
                                  axis=1).where(df != 0, 
                                                other="").apply(lambda row: ''.join(row.values), axis=1)

which gives:

x  y    z non_zero_columns  non zero
a  0 -1  0.0                y         1
b  0  3  1.1               yz         2
c  9  0  3.0               xz         2

Apply that to your dataframe.

Another solution for the same dataframe would be:

cols = df.columns
df = df.apply(lambda x: x != 0)
df['Non_zero_columns'] = df.apply(lambda x: list(cols[x.values]), axis=1)
df['Length'] = df['Non_zero_columns'].str.len()

which produces:

  x      y      z Non_zero_columns  Length
a  False   True  False              [y]       1
b  False   True   True           [y, z]       2
c   True  False   True           [x, z]       2