I am grabbing some html and hence will have a string of an entire page of html.
i want to grab multiple strings in this html each with a common delimiter say --text--
so i would want the "text" part.
What is the quickest and most efficient way to do this?
CodePudding user response:
You can use HTMLParser
in python:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print("data:", data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Page Title</title></head>'
'<body><h1>some text</h1></body></html>')
The output will be:
data: some text
CodePudding user response:
the re module feels like a good match for this if it is like the following
html_text = """ a bunch of text i dont care about
-- something important -- more stuff i dont care about -- something else
important --"""
matches = re.findall("--(.*?)--",html_text,re.DOTALL)
print(matches)
if instead you say want to grab all the text inside of some <htmltag >
then you should use something like beautifulsoup (bs4)
import bs4
html_text = "<a class='interesting'>cool</a><div class='boring'>asd</div><div class='interesting'>beans</div>"
soup = bs4.BeautifulSoup(html_text,features="html.parser")
for e in soup.findAll(None,{"class":"interesting"}):
print(e.text)
lastly if its more like a--list--of--cool--stuff--with--delimiter
then you should probably just use my_string.split("--")