Dropping column if more than half of the values are same

I have pandas df which looks like the pic: enter image description here

I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this

I trid using :pandas.Series.value_counts but with no luck

CodePudding user response：

You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.

n=len(df)
cols_to_drop=[]
for e in list(df.columns):
    max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
    if 2*max_occ>n: # Check if it is more than half the len of the dataset
         cols_to_drop.append(e) 
df=df.drop(cols_to_drop,axis=1)

CodePudding user response：

You can use apply value_counts and getting the first value to get the max count:

count = df.apply(lambda s: s.value_counts().iat[0])
col1    4
col2    2
col3    6
dtype: int64

Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:

count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)]  # use 'lt' if needed to drop if exactly half

output:

Use input:

df = pd.DataFrame({'col1': [0,1,0,0,0,1],
                   'col2': [0,1,0,1,2,3],
                   'col3': [0,0,0,0,0,0],
                  })

CodePudding user response：

Boolean slicing with a comprension

df.loc[:, [
    df.shape[0] // s.value_counts().max() >= 2
    for _, s in df.iteritems()
]]

   col2
0     0
1     1
2     0
3     1
4     2
5     3

Credit to @mozway for input data.