I want to regex a list of URLs.
The links format looks like this:
https://sales-office.ae/axcapital/damaclagoons/?cm_id=14981686043_130222322842_553881409427_kwd-1434230410787_m__g_&gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff2k3IHDPpViVTzUfxx4NRD-fSsfWkCDT-ywLPY2C6OrdTP36x431QsaAt2dEALw_wcB
The part I need:
https://sales-office.ae/axcapital/damaclagoons/
I used to use this:
re.findall('://([\w\-\.] )', URL)
However, it gets me this:
sales-office.ae
Can you help, please?
CodePudding user response:
Instead of looking for \w
etc. which would only match the domain, you're effectively looking for anything up to where the URL arguments start (the first ?
):
re.search(r'[^?]*', URL)
This means: from the beginning of the string (search
), all characters that are not ?
. A character class beginning with ^
negates the class, i.e. not matching instead of matching.
This gives you a match object, where [0]
will be the URL you're looking for.
CodePudding user response:
You can do that wihtout using regex by leveraging urllib.parse.urlparse
from urllib.parse import urlparse
url = "https://sales-office.ae/axcapital/damaclagoons/?cm_id=14981686043_130222322842_553881409427_kwd-1434230410787_m__g_&gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff2k3IHDPpViVTzUfxx4NRD-fSsfWkCDT-ywLPY2C6OrdTP36x431QsaAt2dEALw_wcB"
parsed_url = urlparse(url)
print(f"{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}")
Outputs
CodePudding user response:
Based on your example, this looks like it would work:
\w ://\S \.\w \/\S \/
CodePudding user response:
Based on: How to match "anything up until this sequence of characters" in a regular expression?
. ?(?=\?)
so:
re.findall(". ?(?=\?)", URL)