I am quite new to Python (3.9) but with everything available online I thought I might be able to solve a problem.
I am trying to extract a person's name from an invoice, which may be 2-3 consecutive words at any length and may rarely contain a hyphen.
Phone: (111) 311-1111
Desired Name: Friday twk-test Date of Birth: 01/01/1988
Here is what I have so far:
(?<=Desired Name:\s{3}[A-Za-z])[A-Za-z] \s[A-Za-z]
Match:
riday twk
The output needs to be:
Friday twk-test
CodePudding user response:
You can use
\bDesired Name:\s*([^\W\d_] (?:[\s-] [^\W\d_] ){1,2})
See the regex demo.
Details:
\b
- a word boundaryDesired Name:
- a literal string\s*
- zero or more whitespaces([^\W\d_] (?:[\s-] [^\W\d_] ){1,2})
- Group 1: two or three words consisting of only Unicode letters that are separated with one or more whitespaces or hyphens:[^\W\d_]
- one or more Unicode letters(?:[\s-] [^\W\d_] ){1,2}
- one or two sequences of:[\s-]
- one or more whitespaces or-
chars[^\W\d_]
- one or more Unicode letters.
If there can be a single whitespace or hyphen, remove
after [\s-]
.
See a Python demo:
import re
text="Phone: (111) 311-1111\nDesired Name: Friday twk-test Date of Birth: 01/01/1988"
pattern=r"\bDesired Name:\s*([^\W\d_] (?:[\s-] [^\W\d_] ){1,2})"
match = re.search(pattern, text)
if match:
print(match.group(1))
# => Friday twk-test
CodePudding user response:
Assuming that all of your invoices follow this same structure, then you can use this regex:
\bDesired Name:\s*([A-Za-z\s\-] ?(?=\s Date of Birth))
A demo is here: regex101 demo
What this does is:
\b
: Word boundaryDesired Name:
: string to match that we know is before the name\s*
: match zero or more whitespaces([A-Za-z\s\-] ?(?=\s Date of Birth))
: A capturing group to match the name[A-Za-z\s\-]
: match any letter (either upper or lower case), as well as whitespace and hyphens.?(?=\s Date of Birth)
: positive lookahead, so it will match everything up until this string.
What this means is that if someone's first name and last name both have a hyphen, and they also have another name, the entire name will be captured.