I know similar questions like this have already been asked on the platform but I checked them and did not find the help I needed.
I have some String such as :
path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"
path = "activewear/2356_15890_9397775? povid=ApparelNavpopular data structure you to be informed when a regression"
I have a function :
def extract_id(path):
pattern = re.compile(r"([0-9] (_[0-9] ) )", re.IGNORECASE)
return pattern.match(path)
The expected results are 5438_133195_9917949_1218833 and 2356_15890_9397775. I tested the function online, and it seems to produce the expected result but my it's returning None in my app. What am I doing wrong? Thanks.
CodePudding user response:
You don't need any capture groups, you can get a match only and return .group()
using re.seach:
\b\d (?:_\d ) \b
\b
A word boundary\d
Match 1 digits(?:_\d )
Repeat 1 times_
and 1 digits\b
A word boundary
import re
path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"
pattern = re.compile(r"\b\d (?:_\d ) \b")
def extract_id(path):
return pattern.search(path).group()
print(extract_id(path))
Output
5438_133195_9917949_1218833
CodePudding user response:
match
is used to match an entire statement. What you want is search
. You have to use group
to retrieve matches from a search
. You don't need re.IGNORECASE
if you are looking for characters that don't have a case. You should compile
your regex only once. Compiling a pattern that never changes, every time a function is called, is not optimal.
You could simplify your expression to ((\d _?) )\?
, which will find a repeating sequence of one or more \d
igits that may be followed by an underscore, and is ultimately ended with a question mark
example:
import re
#do this once
pathid = re.compile(r'((\d _?) )\?')
def extract_id(path:str) -> str:
if m := pathid.search(path): #make sure there is a match
return m.group(1) #return match from group 1 `((\d _?) )`
return None #no match
#use
path = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)
#proof
print(result) #5438_133195_9917949_1218833
Your id comes after the last /
and before the ?
. The below solution will likely be much faster. This doesn't search by pattern, it prunes by position.
def extract_id(path:str) -> str:
#right of the last / to left of the ?
return path.split('/')[-1].split('?')[0]
#use
path = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)
#proof
print(result) #5438_133195_9917949_1218833