Optimization of pandas replace-CodePudding

I am working with a large pandas.DataFrame object of shape (4879301, 214). I am trying to accomplish:

Loop through a specified list of columns from the dataframe
For all values in that dataframe, if a value appears 10% or less of the time, change it to "column_Other"

I have accomplished this with the following code:

thresh = int(df[col].shape[0]*0.10)
for col in col_list:
  for val in df[col].unique():
    if df.loc[df[col]==val].shape[0] <= thresh:
      df[col].replace(val, '{}_Other'.format(col), inplace=True)

However, this has been running for hours and we are barely processed, since some columns have ~50K unique values. Is there anyway to optimize this?

CodePudding user response：

Maybe you can use value_counts with the parameter normalize=True, then keep the index of the result where the percentage is below your threshold. Use loc and isin to replace the wanted values. The interest is that you replace all the values present in your columns less than the threshold at once, not each one with a different replace operation

# dummy data
np.random.seed(10)

n = 100
df = pd.DataFrame({
    0:np.random.choice(list('abcd'), size=n, p=[0.32,0.32,0.32,0.04]),
    1:np.random.choice(list('efghi'), size=n, p=[0.32,0.32,0.23,0.08, 0.05]),
    2:np.random.choice(list('jkl'), size=n, p=[0.33,0.33,0.34])
})
print(df.head())
#    0  1  2
# 0  c  f  l
# 1  a  g  l
# 2  b  e  k
# 3  c  f  l
# 4  b  g  k

Solution using a simple loop for

col_list = [0,1,2]
thresh = 0.10
for col in col_list:
    s = df[col].value_counts(normalize=True)
    print(s.index[s<0.1]) # to remove once sure of the result
    df.loc[df[col].isin(s.index[s<thresh]), col] = '{}_Other'.format(col)

# result of the print
# Index(['d'], dtype='object')
# Index(['h', 'i'], dtype='object')
# Index([], dtype='object')