I have a dataframe from pandas like this.
ID email
1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
5 [email protected]
6 [email protected]
I want to learn company from email(after @).Sample output like this.
Sample output
ID email WorkGoogle WorkFacebook etc.....
1 [email protected] Yes No ..
2 [email protected] No Yes ..
3 [email protected] Yes No ..
4 [email protected] No No ..
5 [email protected] No No ..
6 [email protected] No Yes ..
Need to care Uppercase lowercase.
CodePudding user response:
workplace = df.email.rename("workplace").apply(lambda x: x.split("@")[1].lower())
pd.concat([df,
pd.DataFrame(workplace.apply(lambda x: {x: "Yes"}).to_list(), index=df.index)],
axis=1).fillna("No")
# ID email google.com facebook.com tesla.com hilton.com
# 0 1 [email protected] Yes No No No
# 1 2 [email protected] No Yes No No
# 2 3 [email protected] Yes No No No
# 3 4 [email protected] No No Yes No
# 4 5 [email protected] No No No Yes
# 5 6 [email protected] No Yes No No
But maybe you can just add a column instead of multiple
df["workplace"] = df.email.rename("workplace").str.lower().str.split("@").str[1]
# Then you could do
df.groupby("workplace").agg(list)
# ID email
# workplace
# facebook.com [2, 6] [[email protected], [email protected]]
# google.com [1, 3] [[email protected], [email protected]]
# hilton.com [5] [[email protected]]
# tesla.com [4] [[email protected]]
CodePudding user response:
here is the dynamic way without looping, using pivot_table
:
df['domain'] = df['email'].str.split('@').str[1].str.split('.').str[0].str.lower()
df = df.pivot_table(index=['ID','email'], columns='domain',aggfunc=lambda x: 'Yes' if len(x)> 0 else 'No', fill_value='No')
output:
>>
domain facebook google hilton tesla
ID email
1 [email protected] No Yes No No
2 [email protected] Yes No No No
3 [email protected] No Yes No No
4 [email protected] No No No Yes
5 [email protected] No No Yes No
6 [email protected] Yes No No No
CodePudding user response:
Assuming df is your dataframe, please try this:
import numpy as np
df['workplace'] = df['email'].str.split('@',1).apply(lambda x:x[1].split('.',1)[0])
for workplace in df['workplace'].unique():
df.loc[:,'Work' workplace] = 'No'
df['Work' workplace] = np.where(df['workplace']==workplace,'Yes','No')
df = df.drop(columns=['workplace'],axis=1)
CodePudding user response:
FYI: this solution is not performance efficient. I am sure in the comments on this answer, you may find a more efficient solution
I would first make a list of all companies by saying:
companies = set([email.split('@')[1].split('.')[0].lower() for email in df['email']])
Then simply iterate over this:
for company in companies:
df['Work' company.capitalize()] = df['email'].apply(lambda x: x.split("@")[1].lower()).str.contains(company)
CodePudding user response:
if you are OK with manually making the new column for each domain, you can use this
df['WorkGoogle'] = df['email'].str.lower().str.contains('google')
df['WorkFacebook'] = df['email'].str.lower().str.contains('facebook')
# etc etc
That will give you True/False rather than Yes/No. If you want Yes/No, you can map
df['WorkGoogle'] = df['email'].str.lower().str.contains('google').map({True:'Yes',False:'No'})