how to regex this link?-CodePudding

I want to regex a list of URLs.
The links format looks like this:
https://sales-office.ae/axcapital/damaclagoons/?cm_id=14981686043_130222322842_553881409427_kwd-1434230410787_m__g_&gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff2k3IHDPpViVTzUfxx4NRD-fSsfWkCDT-ywLPY2C6OrdTP36x431QsaAt2dEALw_wcB

The part I need:
https://sales-office.ae/axcapital/damaclagoons/

I used to use this:
re.findall('://([\w\-\.] )', URL)

However, it gets me this:
sales-office.ae

Can you help, please?

CodePudding user response：

Instead of looking for \w etc. which would only match the domain, you're effectively looking for anything up to where the URL arguments start (the first ?):

re.search(r'[^?]*', URL)

This means: from the beginning of the string (search), all characters that are not ?. A character class beginning with ^ negates the class, i.e. not matching instead of matching.

This gives you a match object, where [0] will be the URL you're looking for.

CodePudding user response：

You can do that wihtout using regex by leveraging urllib.parse.urlparse

from urllib.parse import urlparse

url = "https://sales-office.ae/axcapital/damaclagoons/?cm_id=14981686043_130222322842_553881409427_kwd-1434230410787_m__g_&gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff2k3IHDPpViVTzUfxx4NRD-fSsfWkCDT-ywLPY2C6OrdTP36x431QsaAt2dEALw_wcB"

parsed_url = urlparse(url)
print(f"{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}")

Outputs

https://sales-office.ae/axcapital/damaclagoons/

CodePudding user response：

Based on your example, this looks like it would work:

\w ://\S \.\w \/\S \/

CodePudding user response：

Based on: How to match "anything up until this sequence of characters" in a regular expression?

. ?(?=\?)

so:

re.findall(". ?(?=\?)", URL)