HTML parser find tag info-CodePudding

I have a project where uses HTMLParser(). I never worked with this parser, so I read the documentation and found two useful methods I can override to extract information from the site: handle_starttag and handle_data. But I don't understand how to find needed tags info and pass the to handle_data to print info.

I need to get the price from all span tags on the page

<span itemprop="price" content="590">590 dollars</span>

How do I get this?

CodePudding user response：

If every <span> price tag has the itemprop attribute of "price" and the dollar amount is in the content attribute, then you can do it all in hanlde_starttag like this:

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        attrsDict = dict(attrs)
        if tag == 'span' and attrsDict['itemprop'] == 'price':
            price = attrsDict['content']
            print(price)
            # do something else with `price` here


# Example test cases
parser = MyHTMLParser()
parser.feed("""
<span itemprop="price" content="590">590 dollars</span>
<span itemprop="price" content="430">430 dollars</span>
<span itemprop="price" content="684">684 dollars</span>
            """)

CodePudding user response：

This example will initialize custom HTMLParser and get the text between the <span> tags (using handle_data):

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._price_tag = None
        self.prices = []


    def handle_starttag(self, tag, attrs):
        if tag == "span" and ('itemprop', 'price') in attrs:
            self._price_tag = tag

    def handle_endtag(self, tag):
        if tag == self._price_tag:
            self._price_tag = None

    def handle_data(self, data):
        if self._price_tag:
            self.prices.append(data)



parser = MyHTMLParser()
parser.feed(r"""\
<html>
    <span itemprop="price" content="570">570 dollars</span>
    <span itemprop="price" content="590">590 dollars</span>
</html>
"""
)

print(parser.prices)

Prints:

['570 dollars', '590 dollars']

CodePudding user response：

To extract the price from the span tags on a page, you can override the handle_starttag method of the HTMLParser class to check for span tags with the itemprop attribute set to "price", and the handle_data method to extract the price value from the content attribute of the tag.

Here's an example of how you could implement this:

from html.parser import HTMLParser

class PriceParser(HTMLParser):
    def __init__(self):
        self.price = None
        super().__init__()

    def handle_starttag(self, tag, attrs):
        if tag == 'span' and ('itemprop', 'price') in attrs:
            # We've found a span tag with the itemprop attribute set to "price"
            for attr in attrs:
                if attr[0] == 'content':
                    # Extract the price value from the content attribute
                    self.price = attr[1]

   def handle_data(self, data):
        if self.price is not None:
            # We've found the price, so print it and reset the price value
            print(self.price)
            self.price = None

To use this parser, you would first need to retrieve the HTML of the page you want to extract the prices from, and then pass it to the parser like this:

html = '<html><body><span itemprop="price" content="590">590 dollars</span></body></html>'
parser = PriceParser()
parser.feed(html)

Michael M. approach would also work to extract the price values from the span tags.handle_starttag method.

In this implementation, the `handle_starthandle_starttag method creates aspan tag with the itemprop attribute set to "price". If these conditions are met, it extracts the price value from the content attribute and prints it.

To use this parser, you would still need to retrieve the HTML of the page you want to extract the prices from, and then pass it to the parser using the feed method, as in the previous example.

CodePudding user response：

The HTMLParser module provides methods for parsing HTML documents. When you create a parser object, you can override its methods to customize its behavior. In your case, you can override the handle_starttag method to look for span tags with the itemprop attribute set to price, and the handle_data method to print the contents of the content attribute.

Here's an example of how you could do this:

from html.parser import HTMLParser

class PriceParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        # Look for span tags with the itemprop attribute set to price
        if tag == 'span' and ('itemprop', 'price') in attrs:
            # Save the value of the content attribute
            self.price = [v for k, v in attrs if k == 'content'][0]

    def handle_data(self, data):
        # Print the price when we encounter data inside the span tag
        if hasattr(self, 'price'):
            print(self.price)
            del self.price

# Create a parser object and feed it some HTML
parser = PriceParser()
parser.feed('<span itemprop="price" content="590">590 dollars</span>')

This will print "590" to the console. You can then use the feed method to parse an entire HTML document or use the parse method to parse a string of HTML.

I hope this helps!