How to change value have less than 50 frequency in value counts in python-CodePudding

I'm novice with python,

I have a data set with some rows and a column "emaildomain" like this:

1 gmail.com
2 hotmail.com
3 yahoo.com
4 mydomain.com
5 gmail.com
..

I would like to change all the email-domain with less the 50 occurrences with a value 'other'

I'm trying to do a for but I don't know where I'm wrong.

s = df["emaildominio"].value_counts()
x = s[s>50]
for z in df:
    if z not in x:
        df["emaildominio"] = df["emaildominio"].replace([z],'other')
    else:
        continue

where am I wrong?

CodePudding user response：

You can use groupby.transform('size') and boolean indexing:

threshold = 2 # using 2 for the example, you want 50 here

# identify rows with less than threshold occurrences
m = df.groupby('emaildominio')['emaildominio'].transform('size').lt(threshold)

# update
df.loc[m, 'emaildominio'] = 'other'

Alternative with value_counts:

threshold = 2

# identify domains with less than threshold occurrences
drop = df['emaildominio'].value_counts().loc[lambda x: x<threshold].index

# find rows and update
df.loc[df['emaildominio'].isin(drop), 'emaildominio'] = 'other'

output:

  emaildominio
1    gmail.com
2        other
3        other
4        other
5    gmail.com

CodePudding user response：

I would use collections.Counter to count the occurrences of each value in the emaildominio column, then use it into a custom little function to return "other" is the count if less than 50.

import pandas as pd
from collections import Counter

df = pd.DataFrame({'emaildominio':['gmail.com', 'hotmail.com', 'yahoo.com', 'mydomain.com', 'gmail.com']*30}) #only gmail.com will be >50

c = Counter(df.emaildominio.values)
def f(row):
    if c[row.emaildominio] < 50:
        return 'other'
    return row.emaildominio

df['new_col'] = df.apply(f, axis=1)


print(df)

     emaildominio    new_col
0       gmail.com  gmail.com
1     hotmail.com      other
2       yahoo.com      other
3    mydomain.com      other
4       gmail.com  gmail.com
..            ...        ...
145     gmail.com  gmail.com
146   hotmail.com      other
147     yahoo.com      other
148  mydomain.com      other
149     gmail.com  gmail.com