How do I convert HTML hyperlinks into plain text with Python that looks like this:
<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>
My current code looks like this but this package doesn't seem to do the job themselves as they just convert primary HTML text elements to plain text without the link:
from html2text import html2text
text = html2text("<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>")
print(text)
# Result I wanted: "Hello world, it's foo bar time - https://google.com/"
# Result I got: "Hello world, it's foo bar time"
Would really help out if a solution is found.
CodePudding user response:
You can take a look at html.parser, this lib should definitely suffice your needs.
Example from the documentation:
from html.parser import HTMLParser
from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
def handle_comment(self, data):
print("Comment :", data)
def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)
def handle_charref(self, name):
if name.startswith('x'):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)
def handle_decl(self, data):
print("Decl :", data)
parser = MyHTMLParser()
CodePudding user response:
You can use Beautiful Soup (bs4 package)
from bs4 import BeautifulSoup
spam = """<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>
<p>Hello world, it's <a href="https://stackoverflow.com">spam eggs</a></p>"""
soup = BeautifulSoup(spam, 'html.parser')
for a_tag in soup.find_all('a'):
a_tag.replace_with(f"{a_tag.text} - {a_tag.get('href')}")
print(soup.text)
Output
Hello world, it's foo bar time - https://google.com
Hello world, it's spam eggs - https://stackoverflow.com
Note, you can work from here. Look at tag.replace_with()
and tag.unwrap()
Link to the docs
CodePudding user response:
You can use BeautifulSoup module.
from bs4 import BeautifulSoup
html = "<p>Hello world, it's <a href='https://google.com'>foo bar time</a></p>"
soup = BeautifulSoup(html, features="html.parser")
text = soup.get_text()
url_part = soup.find('a')
url_str = url_part['href']
print(text , ' - ' , url_str)
To import the module, you need to install it
pip install beautifulsoup4