Home > Back-end >  Find all urls, handling ones with no space between
Find all urls, handling ones with no space between

Time:01-02

I would like to extract the urls in a string such as

"Here are the urls https://user:[email protected]:8080http://142.42.1.1/"

Notice there are multiple urls right next to each other (with literally no gap between each one). so far I have tried

PATTERN_URL = r'(?i)\b((?:[a-z][\w-] :(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-] [.][a-z]{2,4}/)(?:[^\s()<>] |\(([^\s()<>] |(\([^\s()<>] \)))*\)) (?:\(([^\s()<>] |(\([^\s()<>] \)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))'

re.findall(PATTERN_URL, value)

It results in ["https://user:[email protected]:8080http://142.42.1.1/"] but the regex is close to what I want: if value had a space between the urls it would match both urls ["https://user:[email protected]:8080", "http://142.42.1.1/"]

I was thinking of a potentially horrific solution: add to the end of the pattern a negative look ahead for the pattern. Probably overkill. I suspect there is a better way.

CodePudding user response:

This should do the trick. It will match everything after http, https or www until it encounter either of it again or a space or EOL.

((?:https?|www).*?(?=http|www|\s|$))

Live Demo

Note:

This regex will also match invalid urls, but will not be able to match urls that do not start with http, https or www.

CodePudding user response:

I have taken a different approach but it works:

exstr = "Here are the urls https://user:[email protected]:8080http://142.42.1.1/"
throw_list = exstr.split('http')
result = []
x = 0

while x < len(throw_list):
    if x != 0:
        thr_var = 'http'   throw_list[x]
        result.append(thr_var)
    x = x   1

The result:

['https://user:[email protected]:8080', 'http://142.42.1.1/']

CodePudding user response:

If all of the urls start with "http", you can do:

test = "Here are the urls https://user:[email protected]:8080http://142.42.1.1/"
urls_joined = re.search(r"http. ", test).group()  # to extract only urls part
    
urls = ["http" u for u in urls_joined.split("http") if u]

Contents of the urls:

['https://user:[email protected]:8080', 'http://142.42.1.1/']
  • Related