I'm novice with python,
I have a data set with some rows and a column "emaildomain" like this:
1 gmail.com
2 hotmail.com
3 yahoo.com
4 mydomain.com
5 gmail.com
..
I would like to change all the email-domain with less the 50 occurrences with a value 'other'
I'm trying to do a for but I don't know where I'm wrong.
s = df["emaildominio"].value_counts()
x = s[s>50]
for z in df:
if z not in x:
df["emaildominio"] = df["emaildominio"].replace([z],'other')
else:
continue
where am I wrong?
CodePudding user response:
You can use groupby.transform('size')
and boolean indexing:
threshold = 2 # using 2 for the example, you want 50 here
# identify rows with less than threshold occurrences
m = df.groupby('emaildominio')['emaildominio'].transform('size').lt(threshold)
# update
df.loc[m, 'emaildominio'] = 'other'
Alternative with value_counts
:
threshold = 2
# identify domains with less than threshold occurrences
drop = df['emaildominio'].value_counts().loc[lambda x: x<threshold].index
# find rows and update
df.loc[df['emaildominio'].isin(drop), 'emaildominio'] = 'other'
output:
emaildominio
1 gmail.com
2 other
3 other
4 other
5 gmail.com
CodePudding user response:
I would use collections.Counter to count the occurrences of each value in the emaildominio
column, then use it into a custom little function to return "other"
is the count if less than 50.
import pandas as pd
from collections import Counter
df = pd.DataFrame({'emaildominio':['gmail.com', 'hotmail.com', 'yahoo.com', 'mydomain.com', 'gmail.com']*30}) #only gmail.com will be >50
c = Counter(df.emaildominio.values)
def f(row):
if c[row.emaildominio] < 50:
return 'other'
return row.emaildominio
df['new_col'] = df.apply(f, axis=1)
print(df)
emaildominio new_col
0 gmail.com gmail.com
1 hotmail.com other
2 yahoo.com other
3 mydomain.com other
4 gmail.com gmail.com
.. ... ...
145 gmail.com gmail.com
146 hotmail.com other
147 yahoo.com other
148 mydomain.com other
149 gmail.com gmail.com