How to drop a column if there is more than 55% repeated values in the column?-CodePudding

I have a dataframe and need to drop all the columns that contain more than 55% of repeated/duplicate values in each column.

Would anyone be able to assist me on how to do this?

CodePudding user response：

Let's use pd.Series.duplciated:

cols_to_keep=df.columns[df.apply(pd.Series.duplicated).mean() <= .55]
df[cols_to_keep]

CodePudding user response：

If you're referring to columns in which the most common value is repeated in more than 55% or rows, here's a solution

from collections import Counter
# assuming some DataFrame named df
bool_idx = df.apply(lambda x: max(Counter(x).values()) < len(x) * .55, axis=0)
df = df.loc[:, bool_idx]

if you're talking about non-unique values, this works:

bool_idx = df.apply(
    lambda x: sum(
        y for y in Counter(x).values() if y > 1
    ) < .55 * len(x), 
    axis=0
)
df = df.loc[:, bool_idx]

CodePudding user response：

Please try this:

Let df1 be your dataframe:

drop_columns = []
drop_threshold = 0.55 #define the percentage criterion for drop
for cols in df1.columns:
    df_count = df1[cols].value_counts().reset_index()
    df_count['drop_percentage'] = df_count[cols]/df1.shape[0]
    df_count['drop_criterion']  = df_count['drop_percentage'] > drop_threshold
    if True in df_count.drop_criterion.values:
        drop_columns.append(cols)
df1 = df1.drop(columns=drop_columns,axis=1)