Home > Net >  Regex find URL with the first extension
Regex find URL with the first extension

Time:03-19

I have string like below

string = 'https://somewebsite.com/itesr0824/products/YYA-002/fdrop-tQ?position=6'

How can I find

'https://somewebsite.com/itesr0824'

with regex?

I've tried

re.sub('[^https://somewebsite.com/[a-zA-Z0-9]. $','',string)

but it only finds

'https://somewebsite.com/itesr0824/products/YYA'

CodePudding user response:

Why on Earth would you use regular expressions for this when Python has a built-in URL parser? Don't re-invent the wheel and needlessly require accounting for all the weird edge cases URLs might present for you and instead use urllib.parse.urlparse() and urllib.parse.urljoin:

import urllib.parse
string = "https://somewebsite.com/itesr0824/products/YYA-002/fdrop-tQ?position=6"
parsedURL = urllib.parse.urlparse(string)
trimmedURL = urllib.parse.urljoin(parsedURL.scheme   "://"   parsedURL.netloc, parsedURL.path.split("/")[1]) # 'https://somewebsite.com/itesr0824'

CodePudding user response:

import re

string = 'https://somewebsite.com/itesr0824/products/YYA-002/fdrop-tQ?position=6'

x = re.sub('https://somewebsite\.com/\w ','',string)

# where: 
# \w - matches any letter, digit or underscore. Equivalent to [a-zA-Z0-9_]
#   - one or more

print(x)

Prints

/products/YYA-002/fdrop-tQ?position=6
  • Related