Best way to grab multiple strings from text?-CodePudding

I am grabbing some html and hence will have a string of an entire page of html.

i want to grab multiple strings in this html each with a common delimiter say --text--

so i would want the "text" part.

What is the quickest and most efficient way to do this?

CodePudding user response：

You can use HTMLParser in python: from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_data(self, data):
    print("data:", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Page Title</title></head>'
            '<body><h1>some text</h1></body></html>')

The output will be:
data: some text

CodePudding user response：

the re module feels like a good match for this if it is like the following

html_text = """ a bunch of text i dont care about
-- something important -- more stuff i dont care about -- something else 
 important --"""
matches = re.findall("--(.*?)--",html_text,re.DOTALL)
print(matches)

if instead you say want to grab all the text inside of some <htmltag > then you should use something like beautifulsoup (bs4)

import bs4
html_text = "<a class='interesting'>cool</a><div class='boring'>asd</div><div class='interesting'>beans</div>"
soup = bs4.BeautifulSoup(html_text,features="html.parser")
for e in soup.findAll(None,{"class":"interesting"}):
    print(e.text)

lastly if its more like a--list--of--cool--stuff--with--delimiter

then you should probably just use my_string.split("--")