I am working with a large pandas.DataFrame
object of shape (4879301, 214)
. I am trying to accomplish:
- Loop through a specified list of columns from the dataframe
- For all values in that dataframe, if a value appears 10% or less of the time, change it to
"column_Other"
I have accomplished this with the following code:
thresh = int(df[col].shape[0]*0.10)
for col in col_list:
for val in df[col].unique():
if df.loc[df[col]==val].shape[0] <= thresh:
df[col].replace(val, '{}_Other'.format(col), inplace=True)
However, this has been running for hours and we are barely processed, since some columns have ~50K unique values. Is there anyway to optimize this?
CodePudding user response:
Maybe you can use value_counts
with the parameter normalize=True
, then keep the index of the result where the percentage is below your threshold. Use loc
and isin
to replace the wanted values. The interest is that you replace all the values present in your columns less than the threshold at once, not each one with a different replace
operation
# dummy data
np.random.seed(10)
n = 100
df = pd.DataFrame({
0:np.random.choice(list('abcd'), size=n, p=[0.32,0.32,0.32,0.04]),
1:np.random.choice(list('efghi'), size=n, p=[0.32,0.32,0.23,0.08, 0.05]),
2:np.random.choice(list('jkl'), size=n, p=[0.33,0.33,0.34])
})
print(df.head())
# 0 1 2
# 0 c f l
# 1 a g l
# 2 b e k
# 3 c f l
# 4 b g k
Solution using a simple loop for
col_list = [0,1,2]
thresh = 0.10
for col in col_list:
s = df[col].value_counts(normalize=True)
print(s.index[s<0.1]) # to remove once sure of the result
df.loc[df[col].isin(s.index[s<thresh]), col] = '{}_Other'.format(col)
# result of the print
# Index(['d'], dtype='object')
# Index(['h', 'i'], dtype='object')
# Index([], dtype='object')