I have a dataframe and need to drop all the columns that contain more than 55% of repeated/duplicate values in each column.
Would anyone be able to assist me on how to do this?
CodePudding user response:
Let's use pd.Series.duplciated
:
cols_to_keep=df.columns[df.apply(pd.Series.duplicated).mean() <= .55]
df[cols_to_keep]
CodePudding user response:
If you're referring to columns in which the most common value is repeated in more than 55% or rows, here's a solution
from collections import Counter
# assuming some DataFrame named df
bool_idx = df.apply(lambda x: max(Counter(x).values()) < len(x) * .55, axis=0)
df = df.loc[:, bool_idx]
if you're talking about non-unique values, this works:
bool_idx = df.apply(
lambda x: sum(
y for y in Counter(x).values() if y > 1
) < .55 * len(x),
axis=0
)
df = df.loc[:, bool_idx]
CodePudding user response:
Please try this:
Let df1 be your dataframe:
drop_columns = []
drop_threshold = 0.55 #define the percentage criterion for drop
for cols in df1.columns:
df_count = df1[cols].value_counts().reset_index()
df_count['drop_percentage'] = df_count[cols]/df1.shape[0]
df_count['drop_criterion'] = df_count['drop_percentage'] > drop_threshold
if True in df_count.drop_criterion.values:
drop_columns.append(cols)
df1 = df1.drop(columns=drop_columns,axis=1)