I'm trying to extract the first 2
numbers in links like these:
https://primer.text.com/sdfg/8406758680-345386743-DSS1-S JasdOdsfrIwetds-Osdgf/
https://primer.text.com/sdfg/8945879094-849328844-DPE-S JsdfeOIert-Isdfu/
https://primer.text.com/sdfg/8493093053-292494834-QW23#Wsdfg#IprfdUiojn2Asdfg-Werts/
The output should be like this:
id1 = ['8406758680', '8945879094','8493093053']
id2 = ['345386743', '849328844', '292494834']
I'm trying to do this using the re
module.
Please, tell me how to do it.
This the code snippet I have so far:
def GetUrlClassId(UrlInPut):
ClassID = ''
for i in UrlInPut:
if i.isdigit():
ClassID =i
elif ClassID !='':
return int(ClassID)
return ""
def GetUrlInstanceID(UrlInPut):
InstanceId = ''
ClassID = 0
for i in UrlInPut:
if i.isdigit() and ClassID==1:
InstanceId =i
elif InstanceId !='':
return int(InstanceId)
if i == '-':
ClassID =1
return ""
I don't want to use something like this. I would like to use regular expressions.
CodePudding user response:
The regex pattern: /(\d{10})-(\d{9})
the brackets are needed to identify the groups of digits, the {}
specifies an exact occurrence of a repetition, doc.
# urls separated by a white space
urls = 'https://primer.text.com/sdfg/8406758680-345386743-DSS1-S JasdOdsfrIwetds-Osdgf/ https://primer.text.com/sdfg/8945879094-849328844-DPE-S JsdfeOIert-Isdfu/ https://primer.text.com/sdfg/8493093053-292494834-QW23#Wsdfg#IprfdUiojn2Asdfg-Werts/'
urls = urls.split() # as list
import re
ids = [re.search(r'/(\d{10})-(\d{9})', url).groups() for url in urls]
print(list(zip(*ids)))
Output
[('8406758680', '8945879094', '8493093053'), ('345386743', '849328844', '292494834')]
CodePudding user response:
With Regex, you can do a literal match on the base URL, and then capture two groups of multiple digits using \d
(\d
matches 0-9,
matches at least one of the proceeding group). re.findall
returns a list of matching groups.
import re
l1 = "https://primer.text.com/sdfg/8406758680-345386743-DSS1-S JasdOdsfrIwetds-Osdgf/"
l2 = "https://primer.text.com/sdfg/8945879094-849328844-DPE-S JsdfeOIert-Isdfu/"
l3 = "https://primer.text.com/sdfg/8493093053-292494834-QW23#Wsdfg#IprfdUiojn2Asdfg-Werts/"
for l in [l1, l2, l3]:
result = re.findall(r'https://primer.text.com/sdfg/(\d )-(\d )', l)
print(result)
Output:
[('8406758680', '345386743')]
[('8945879094', '849328844')]
[('8493093053', '292494834')]
From here, reformatting into your desired data structure should be simple enough (use zip
or something).