I am merging multiple excel sheets into a pandas DataFrame and extracting the email address from a Report_Name column. I have been able to execute what I need but the hardship is that there are over 5000 email addresses that I would have to run checks against (in this example only five). While I could type out all the conditions for these, it's not practical.
So my question is:
Can I pass a single list that includes these 5000 email addresses as a condition (somehow) in an if statement? Below is my code of how I am currently doing it.
| **Report_Name** |
|--------------------------------------------------- |
|SYN-Laptops-Nov10 (002) |
|something_offer [email protected] |
|another thing [email protected] |
|my offer is attached ooo 12-31 [email protected] |
|copy of offer [email protected] |
|private offering copy [email protected] |
#----------- extract the email address from the Report Name column -----------#
#blank list to collect and store the email addresses
collected_emails = []
#----------- For Loop to iterate through the values under the ['Report_Name'] column -----------#
for report_name_value in excel_df['Report_Name']:
if '[email protected]' in report_name_value:
collected_emails.append('[email protected]')
elif '[email protected]' in report_name_value:
collected_emails.append('[email protected]')
elif '[email protected]' in report_name_value:
collected_emails.append('[email protected]')
elif '[email protected]' in report_name_value:
collected_emails.append('[email protected]')
elif '[email protected]' in report_name_value:
collected_emails.append('[email protected]')
else:
collected_emails.append('No Email Address')
#create DataFrame for the collected emails
collected_emails_df = pd.DataFrame(collected_emails, columns = ['Email_Address'])
#create master_df to concat both the excel_df and collected_emails_df together
master_df = pd.concat([excel_df, collected_emails_df,], axis = 1)
#export master DataFrame to an excel file and save it on the SharePoint directory
master_df.to_excel(output_local 'Offers.xlsx', index=False)}
RESULT
| **Report_Name** | **Email_Address** |
| SYN-Laptops-Nov10 (002) | No Email Address |
| something_offer [email protected] | [email protected] |
| another thing [email protected] | [email protected] |
| my offer is attached ooo 12-31 [email protected] | [email protected] |
| copy of offer [email protected] | [email protected] |
| private offering copy [email protected] | [email protected] |
I am a beginner with python and was unable to pull up any references of how to tackle this problem specifically. Hence the post. Thanks for your time and I appreciate any advice you can offer.
CodePudding user response:
Sounds like a good case for lambda and filter!
list_to_check = ['[email protected]', ...] # Pass this in to your function
is_match_func = lambda x: x in excel_df['Report_Name']
all_matches = list(filter(is_match_func, list_to_check))
for email in all_matches:
collected_emails.append(email)
CodePudding user response:
Given:
# df
Report_Name
0 SYN-Laptops-Nov10 (002)
1 something_offer [email protected]
2 another thing [email protected]
3 my offer is attached ooo 12-31 rocksteps@domai...
4 copy of offer [email protected]
5 private offering copy [email protected]
Doing:
df['Emails'] = df.Report_Name.str.extract('(\w @\w \.\w )')
print(df)
- Regex Explanation - basically, we know that there aren't spaces in an email, and that they all contain [chars
@
chars.
chars]
Output:
Report_Name Emails
0 SYN-Laptops-Nov10 (002) NaN
1 something_offer [email protected] [email protected]
2 another thing [email protected] [email protected]
3 my offer is attached ooo 12-31 rocksteps@domai... [email protected]
4 copy of offer [email protected] [email protected]
5 private offering copy [email protected] [email protected]
It's not clear what you want to do after this point...
CodePudding user response:
You can simply use pandas.Series.str.extract
with a regular expression to extract all the emails.
Try this :
collected_emails = (
excel_df['Report_Name']
.str.extract(r'([a-zA-Z0-9.-_] @. \.com)')
.dropna()
.squeeze()
.tolist()
)
# Output :
print(collected_emails)
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']