So I am currently working on an assignment requiring us to extract phone numbers, emails, and websites from a text document. The lecturer required us to output it into a list of tuples, each of them contains the initial index, the length, and the match. Here are some examples: [(1,10,'0909900008'), (35,16,'[email protected]')], ... Since there are three different requirements to achieve. How can I put all of them into a list of tuples? I have thought of the three regex expressions, but I can't really put all of them together in 1 list. Should I create a new expression to describe all three? Thanks for your help.
result = []
# Match with RE
email_pattern = r'[\w\.-] @[\w\.-] (?:\.[\w] ) '
email = re.findall(email_pattern, string)
for match in re.finditer(email_pattern, string):
print(match.start(), match.end() - match.start(), match.group())
phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
phone = re.findall(phone_pattern, string)
for match in re.finditer(phone_pattern, string):
print(match.start(), match.end() - match.start(), match.group())
website_pattern = '(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9] \.[^\s]{2,}|www\.[a-zA-Z0-9] \.[^\s]{2,})'
web = re.findall(website_pattern, string)
for match in re.finditer(website_pattern, string):
print(match.start(), match.end() - match.start(), match.group())
My output:
# Text document
should we use regex more often? let me know at [email protected] or [email protected]. To further notice, contact Khoi at 0957507468 or accessing
https://web.de or maybe www.google.com, or Mr.Q at 0912299922.
# Output
47 21 [email protected]
72 13 [email protected]
122 10 0957507468
197 10 0912299922
146 14 https://web.de
170 15 www.google.com,
CodePudding user response:
Rather than print
ing do append
ing to result
list
then print
it, i.e. change
print(match.start(), match.end() - match.start(), match.group())
to
result.append((match.start(), match.end() - match.start(), match.group()))
and same way for others, then at end
print(result)