I am having issues with emails address and with a small correction, they are can be converted to valid email addresses.
For Ex:
[email protected], --- Not valid
'[email protected], --- Not valid
([email protected]), --- Not valid
([email protected]), --- Not valid
:[email protected], --- Not valid
//[email protected] --- Not valid
[email protected] --- valid
...
I could write "if else", but if a new email address comes with new issues, I need to write "ifelse " and update every time.
What is the best way to clean all these small issues, some python packes or regex? PLease suggest.
CodePudding user response:
You can do this (I basically check if the elements in the email are alpha characters or a point, and remove them if not so):
emails = [
'[email protected]',
'([email protected])',
'([email protected])',
':[email protected]',
'//[email protected]',
'[email protected]'
]
def correct_email_format(email):
return ''.join(e for e in email if (e.isalnum() or e in ['.', '@']))
for email in emails:
corrected_email = correct_email_format(email)
print(corrected_email)
output:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
CodePudding user response:
Data clean-up is messy but I found the approach of defining a set of rules to be an easy way to manage this (order of the rules matters):
rules = [
lambda s: s.replace(' ', ' '),
lambda s: s.strip(" ,'"),
]
addresses = [
' [email protected],',
'[email protected],'
]
for a in addresses:
for r in rules:
a = r(a)
print(a)
and here is the resulting output:
[email protected]
[email protected]
Make sure you write a test suite that covers both invalid and valid data. It's easy break, and you may be tweaking the rules often.