Home > Net >  Dropping column if more than half of the values are same - Python
Dropping column if more than half of the values are same - Python

Time:04-06

I have pandas df which looks like the pic: enter image description here

I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this

I trid using :pandas.Series.value_counts but with no luck

CodePudding user response:

You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.

n=len(df)
cols_to_drop=[]
for e in list(df.columns):
    max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
    if 2*max_occ>n: # Check if it is more than half the len of the dataset
         cols_to_drop.append(e) 
df=df.drop(cols_to_drop,axis=1)

CodePudding user response:

You can use apply value_counts and getting the first value to get the max count:

count = df.apply(lambda s: s.value_counts().iat[0])
col1    4
col2    2
col3    6
dtype: int64

Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:

count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)]  # use 'lt' if needed to drop if exactly half

output:

   col2
0     0
1     1
2     0
3     1
4     2
5     3

Use input:

df = pd.DataFrame({'col1': [0,1,0,0,0,1],
                   'col2': [0,1,0,1,2,3],
                   'col3': [0,0,0,0,0,0],
                  })

CodePudding user response:

Boolean slicing with a comprension

df.loc[:, [
    df.shape[0] // s.value_counts().max() >= 2
    for _, s in df.iteritems()
]]

   col2
0     0
1     1
2     0
3     1
4     2
5     3

Credit to @mozway for input data.

  • Related