finding href path value Python regex-CodePudding

Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.

html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""

output should be like that:

["/relative/path","//other.host/same-protocol","https://example.com"]

The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function

//I have tried this bellow one but not working its only give output: ["https://example.com"]

 urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2})) ', html)
 print(urls)

CodePudding user response：

You could try using positive lookbehind to find the quoted strings in front of href= in html

pattern = re.compile(r'(?<=href=\")(?!#)(. ?)(?=#|\")')
urls = re.findall(pattern, html)

See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall

CodePudding user response：

from typing import List

html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""

href_prefix = "href=\""


def get_links_from_html(html: str, result: List[str] = None) -> List[str]:
    if result == None:
        result = []

    is_splitted, _, rest = html.partition(href_prefix)

    if not is_splitted:
        return result

    link = rest[:rest.find("\"")].partition("#")[0]

    if link:
        result.append(link)
    return get_links_from_html(rest, result)


print(get_links_from_html(html))