Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.
html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""
output should be like that:
["/relative/path","//other.host/same-protocol","https://example.com"]
The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function
//I have tried this bellow one but not working its only give output: ["https://example.com"]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2})) ', html)
print(urls)
CodePudding user response:
You could try using positive lookbehind to find the quoted strings in front of href=
in html
pattern = re.compile(r'(?<=href=\")(?!#)(. ?)(?=#|\")')
urls = re.findall(pattern, html)
See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall
CodePudding user response:
from typing import List
html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""
href_prefix = "href=\""
def get_links_from_html(html: str, result: List[str] = None) -> List[str]:
if result == None:
result = []
is_splitted, _, rest = html.partition(href_prefix)
if not is_splitted:
return result
link = rest[:rest.find("\"")].partition("#")[0]
if link:
result.append(link)
return get_links_from_html(rest, result)
print(get_links_from_html(html))