I have this text where I want to identify only certain three digit numbers using my depart city (NYC) as the positive lookbehind expression. I don't want to include it or anything else in the result, other than the desired three digit number. I can't simply use \d{3,} because there are other three digit numbers in this text I haven't included here which should not be in the output.
"Depart: NYC (etd 9/30), NJ (etd 10/4)
Arrive LAX
Rate: USD500, 700P"
With this code
(?<=NYC)(\D|\S)*\d{3,}
outputs
" (etd 9/30), NJ (etd 10/4) Arrive LAX Rate: USD500, 700"
but I want it to output "700" only. I also want to write a regex that will only output 500 without using USD as the positive lookbehind expression. Is this possible? I've also tried
(?<=NYC)(?<=(\D|\S)*)\d{3,}
but this doesn't output anything.
CodePudding user response:
You can use use
(?s)NYC.*?\b(\d{3,})
See the regex demo. Details:
(?s)
-re.DOTALL
inline modifierNYC
-NYC
word.*?
- any zero or more chars as few as possible\b
- a word boundary(\d{3,})
- Group 1: three or more digits.
See the Python demo:
import re
text = """Depart: NYC (etd 9/30), NJ (etd 10/4)
Arrive LAX
Rate: USD500, 700P"""
m = re.search(r'(?s)NYC.*?\b(\d{3,})', text)
if m:
print(m.group(1))
# => 700
CodePudding user response:
Using (\D|\S)
matches any character except a digit, or match any non whitespace char. This will match any character, and can also be written as [\s\S]
or you can let the dot match any character using (?s)
or with a flag re.DOTALL
To match the first occurrence of 3 digits only without using the USD as positive lookbehind, you can capture 3 digit that are not surrounded by digits:
^[\s\S]*?(?<!\d)(\d{3})(?!\d)
The pattern matches:
^
Start of string[\s\S]*?
Match any char including newlines, as few as possbiel(?<!\d)
Assert not a digit to the left(\d{3})
Capture 3 digits(?!\d)
Assert not a digit to the right
import re
pattern = r"^.*?(?<!\d)(\d{3})(?!\d)"
s = ("\"Depart: NYC (etd 9/30), NJ (etd 10/4)\n"
"Arrive LAX\n"
"Rate: USD500, 700P\"\n")
m = re.search(pattern, s, re.DOTALL)
if m:
print (m.group(1))
To match the 700 after NYC, you can capture 3 digits preceded by a word boundary and assert no following digit
^[\s\S]*?\b(\d{3})(?!\d)
Output
500
import re
pattern = r"^.*?\b(\d{3})(?!\d)"
s = ("\"Depart: NYC (etd 9/30), NJ (etd 10/4)\n"
"Arrive LAX\n"
"Rate: USD500, 700P\"\n")
m = re.search(pattern, s, re.DOTALL)
if m:
print (m.group(1))
Output
700