I have a large DataFrame with 500 columns out of which 300 columns col1, col2, ..col300
appears as follows:
idx col1 col2
a -1 4
b 2 1
c -1 -1
I want to get the following for the 300 columns. Other 200 columns are variables I am not interested in:
idx col1 col2 numPos
a -1 4 1
b 2 1 2
c -1 -1 0
where for each row I want to get the number of positive values. I don't want to use the apply method as there are about 2 million rows in the DataFrame. Is there a pythonic way to do this?
CodePudding user response:
You could select the columns gt
(which creates a boolean DataFrame where it's True if a value is positive) sum
on axis:
df['numPos'] = df[['col1','col2']].gt(0).sum(axis=1)
Maybe you could filter them too, like:
df['numPos'] = df.filter(like='col').gt(0).sum(axis=1)
Output:
idx col1 col2 numPos
0 a -1 4 1
1 b 2 1 2
2 c -1 -1 0
CodePudding user response:
Best way out is to exclude object columns, query what is greater than 0 in the df and sum along the row axis
df['numPos']= df.select_dtypes(exclude='object').gt(0).sum(1)
idx col1 col2 status
0 a -1 4 2
1 b 2 1 3
2 c -1 -1 0
CodePudding user response:
df['numPos'] = (df[cols] > 0).sum(axis=1)
where cols is a list of column names. If the 300 columns are consecutive, then in place of df[cols]
you can use df.iloc[:,start_offset:start_offset 300]
where start_offset
is the index of the first column.