I'm attempting to write a simple Regex expression that retrieves names for me based on the presence of a character string at the end of a line.
I've been successful at isolating each of these patterns using pythex in my data set, but I have been unable to match them as a conditional group.
Can someone explain what I am doing wrong?
Data Example
Mark Samson: CA
Sam Smith: US
Dawn Watterton: CA
Neil Shughar: CA
Fennial Fontaine: US
I want to be able to create a regex expression that uses the end of each line as the condition of the group match - i.e I want a list of those who live in the US from this dataset. I have used each of these expressions in isolation and it seems to work in matching what I am looking for. What I need is help in making the below a grouped search.
Does anyone have any suggestion?
([US]$)([A-Z][a-z] )
CodePudding user response:
Something like the following?
(\w [ \w]*): US
CodePudding user response:
You say "I have been unable to match them as a conditional group", but you are not using any conditional groups. ([US]$)([A-Z][a-z] )
is an example of a pattern that never matches any string as it matches U
or S
, then requires an end of string, and then matches an uppercase ASCII letter and one or more ASCII lowercase letters.
You want any string from start till a colon, whitespaces, and US
substring at the end of string.
Hence, use
. ?(?=:\s*US$)
^(. ?):\s*US$
See the regex demo. Details:
. ?
- one or more chars other than line break chars as few as possible(?=:\s*US$)
- a positive lookahead that matches a location immediately followed with:
, zero or more whitespaces,US
string and the end of string.
See a Python demo:
import re
texts = ["Mark Samson: CA", "Sam Smith: US", "Dawn Watterton: CA", "Neil Shughar: CA", "Fennial Fontaine: US"]
for text in texts:
match = re.search(r". ?(?=:\s*US$)", text)
if match:
print(match.group()) # With r"^(. ?):\s*US$" regex, use match.group(1) here
Output:
Sam Smith
Fennial Fontaine