I would like to extract the urls in a string such as
"Here are the urls https://user:[email protected]:8080http://142.42.1.1/"
Notice there are multiple urls right next to each other (with literally no gap between each one). so far I have tried
PATTERN_URL = r'(?i)\b((?:[a-z][\w-] :(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-] [.][a-z]{2,4}/)(?:[^\s()<>] |\(([^\s()<>] |(\([^\s()<>] \)))*\)) (?:\(([^\s()<>] |(\([^\s()<>] \)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))'
re.findall(PATTERN_URL, value)
It results in ["https://user:[email protected]:8080http://142.42.1.1/"] but the regex is close to what I want: if value
had a space between the urls it would match both urls ["https://user:[email protected]:8080", "http://142.42.1.1/"]
I was thinking of a potentially horrific solution: add to the end of the pattern a negative look ahead for the pattern. Probably overkill. I suspect there is a better way.
CodePudding user response:
This should do the trick. It will match everything after http, https or www until it encounter either of it again or a space or EOL.
((?:https?|www).*?(?=http|www|\s|$))
Note:
This regex will also match invalid urls, but will not be able to match urls that do not start with http, https or www.
CodePudding user response:
I have taken a different approach but it works:
exstr = "Here are the urls https://user:[email protected]:8080http://142.42.1.1/"
throw_list = exstr.split('http')
result = []
x = 0
while x < len(throw_list):
if x != 0:
thr_var = 'http' throw_list[x]
result.append(thr_var)
x = x 1
The result:
['https://user:[email protected]:8080', 'http://142.42.1.1/']
CodePudding user response:
If all of the urls start with "http", you can do:
test = "Here are the urls https://user:[email protected]:8080http://142.42.1.1/"
urls_joined = re.search(r"http. ", test).group() # to extract only urls part
urls = ["http" u for u in urls_joined.split("http") if u]
Contents of the urls
:
['https://user:[email protected]:8080', 'http://142.42.1.1/']