Pandas filter/subset columns based on conditions-CodePudding

I have about 300 columns that are basically encoding of categorical variables. I'd like to drop columns where sum of values of column is <, say 3.

import pandas as pd

df = pd.DataFrame({
                   'id': [0, 1, 2, 3, 4, 5],
                   'col1': [0, 0, 0, 0, 0, 1],
                   'col2': [0, 1, 0, 0, 1, 0],
                   'col3': [1, 1, 0, 1, 1, 0],
                   'col4': [0, 1, 1, 1, 1, 0]
                 })

df.sum(axis=0)

Expected output:

id col3 col4
0     1    0
1     1    1
2     0    1
3     1    1
4     1    1
5     0    0

CodePudding user response：

You can use loc to use a boolean indexing on the columns:

N = 3
out = df.loc[:, df.sum(axis=0) > N]

If id is not actually numeric or if N can be a very large number, then maybe set_index with id first, then use boolean indexing and reset_index back to original:

df = df.set_index('id')
df = df.loc[:, df.sum(axis=0)>3].reset_index()

Output:

   id  col3  col4
0   0     1     0
1   1     1     1
2   2     0     1
3   3     1     1
4   4     1     1
5   5     0     0