I have about 300 columns that are basically encoding of categorical variables. I'd like to drop columns where sum
of values of column is <
, say 3.
import pandas as pd
df = pd.DataFrame({
'id': [0, 1, 2, 3, 4, 5],
'col1': [0, 0, 0, 0, 0, 1],
'col2': [0, 1, 0, 0, 1, 0],
'col3': [1, 1, 0, 1, 1, 0],
'col4': [0, 1, 1, 1, 1, 0]
})
df.sum(axis=0)
Expected output:
id col3 col4
0 1 0
1 1 1
2 0 1
3 1 1
4 1 1
5 0 0
CodePudding user response:
You can use loc
to use a boolean indexing on the columns:
N = 3
out = df.loc[:, df.sum(axis=0) > N]
If id
is not actually numeric or if N
can be a very large number, then maybe set_index
with id
first, then use boolean indexing and reset_index
back to original:
df = df.set_index('id')
df = df.loc[:, df.sum(axis=0)>3].reset_index()
Output:
id col3 col4
0 0 1 0
1 1 1 1
2 2 0 1
3 3 1 1
4 4 1 1
5 5 0 0