Python - Converting HTML hyperlinks to formatted plain text-CodePudding

How do I convert HTML hyperlinks into plain text with Python that looks like this:

<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>

My current code looks like this but this package doesn't seem to do the job themselves as they just convert primary HTML text elements to plain text without the link:

from html2text import html2text

text = html2text("<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>")
print(text)

# Result I wanted: "Hello world, it's foo bar time - https://google.com/"
# Result I got: "Hello world, it's foo bar time"

Would really help out if a solution is found.

CodePudding user response：

You can take a look at html.parser, this lib should definitely suffice your needs.

Example from the documentation:

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

parser = MyHTMLParser()

CodePudding user response：

You can use Beautiful Soup (bs4 package)

from bs4 import BeautifulSoup

spam = """<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>
<p>Hello world, it's <a href="https://stackoverflow.com">spam eggs</a></p>"""

soup = BeautifulSoup(spam, 'html.parser')

for a_tag in soup.find_all('a'):
    a_tag.replace_with(f"{a_tag.text} - {a_tag.get('href')}")

print(soup.text)

Output

Hello world, it's foo bar time - https://google.com
Hello world, it's spam eggs - https://stackoverflow.com

Note, you can work from here. Look at tag.replace_with() and tag.unwrap() Link to the docs

CodePudding user response：

You can use BeautifulSoup module.

from bs4 import BeautifulSoup

html = "<p>Hello world, it's <a href='https://google.com'>foo bar time</a></p>"
soup = BeautifulSoup(html, features="html.parser")

text = soup.get_text()
url_part = soup.find('a')
url_str = url_part['href']

print(text , ' - ' , url_str)

To import the module, you need to install it

pip install beautifulsoup4