Selector error: Exception has occurred: AttributeError 'str' object has no attribute &#039-CodePudding

I'm trying to scrape the score of the third tennis match in the table on this site.

I need to parse a match score into individual set scores from this html element:

html = (
    '''
    <td >
        <a data-override-transition="" data-ga-category="" data-ga-action="Click" data-ga-label="" data-use-ga="true" >
            <!-- Determine set tie break score --> 76 <sup>8</sup>
            <!-- Determine set tie break score --> 61
        </a>
    </td>
    '''
)

To simulate the scraping of a HtmlResponse object I've used the following line for this post:

from scrapy.http.response.html import HtmlResponse

response = HtmlResponse("xyz", body=html, encoding="utf-8")

I can get the raw text string of the match score using this xpath:

response.xpath("normalize-space()").get()

However, it will be quite an extensive job to build a parser to account for the all the different versions of tiebreaks that exist in tennis. What would be much easier is to be able to identify a tiebreak score on the basis that it is always located in a superscript element. As sequence matters the best solution I could come up with was to loop through each line of text, determine whether the line was text or a sup tag and then assign it to a set score container. For this example let's assume it's a dictionary:

{"set_1_score": 76, "set_1_tiebreak_score": 8, "set_2_score": 61, "set_2_tiebreak_score": None}

As there will be a bit of additional complexity in this process to keep track of set numbers etc I've simplified the code down to just printing whether a line of text is in fact a sup tag. From looking at the HtmlResponse object I can see there's an attrib attribute which I think is what I need. However, I can't seem to access it. The following code:

text_lines = response.xpath("//text()")
for tl in text_lines:
    print(tl.attrib)

Gives the following error:

Exception has occurred: AttributeError
'str' object has no attribute 'attrib'

This is especially weird as when I run type(tl) it is actually a Selector object not a str.

Below is also a print of the details of the tl object:

special variables:
function variables:
class variables:
attrib: 'Traceback (most recent call last):\n  File "/Users/philipjoss/.vscode/extensions/ms-python.python-2022.18.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_resolver.py", line 162, in _get_py_dictionary\n    def _get_py_dictionary(self, var, names=None, used___dict__=False):\n  File "/Users/<me>/opt/miniconda3/envs/capra/lib/python3.9/site-packages/parsel/selector.py", line 387, in attrib\n    @property\nAttributeError: \'str\' object has no attribute \'attrib\'\n'
namespaces: {'re': 'http://exslt.org/reg...xpressions', 'set': 'http://exslt.org/sets'}
response: None
root: '\n        '
text: 'Traceback (most recent call last):\n  File "/Users/<me>/.vscode/extensions/ms-python.python-2022.18.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_resolver.py", line 162, in _get_py_dictionary\n    def _get_py_dictionary(self, var, names=None, used___dict__=False):\nAttributeError: text\n'
type: 'html'
_css2xpath: <bound method Selector._css2xpath of <Selector xpath='//text()' data='\n        '>>
_csstranslator: <parsel.csstranslator.HTMLTranslator object at 0x7fbfca5cf850>
_default_namespaces: {'re': 'http://exslt.org/reg...xpressions', 'set': 'http://exslt.org/sets'}
_default_type: None
_expr: '//text()'
_get_root: <bound method Selector._get_root of <Selector xpath='//text()' data='\n        '>>
_lxml_smart_strings: False
_parser: <class 'lxml.html.HTMLParser'>
_tostring_method: 'html'

I guess this is a two part question:

Why am I getting the error that I am? It looks like there might be an issue in the parsel package?
Is this the best way to cycle through the different lines of the text element to build the dictionary?

CodePudding user response：

The below solution is based on your the above html doc and the 3rd score column

from scrapy.crawler import CrawlerProcess

import scrapy
import re
class Sp1Spider(scrapy.Spider):
    name = 'sp1'
    start_urls = ['https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results']
    custom_settings = {'USER_AGENT': 'Mozilla/5.0'}
    def parse(self, response):
        
        for row in response.xpath('//*[@]//tbody//tr'):
            yield {
                'score': re.sub(r'\s ', '', ''.join(row.xpath('.//*[@]/a//text()').getall()).strip().replace('\r\n',''))
            }
                   
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(Sp1Spider)
    process.start()

Output:

{'score': '7562'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '466264'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '76861'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6375'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6161'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6264'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '76163'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6364'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6262'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6367263'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6464'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6464'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6462'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6464'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6726275'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '3676561'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6463'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6162'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '467610764'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6364'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6463'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6462'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '603664'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '7561'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6262'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6162'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6164'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '466364'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '67576461'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6030(RET)'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6164'}
2022-11-05 22:02:24 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-05 22:02:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 235,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 41729,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 2.673797,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 5, 16, 2, 24, 775020),
 'httpcompression/response_bytes': 258687,
 'httpcompression/response_count': 1,
 'item_scraped_count': 31,

CodePudding user response：

The reason you are getting the error is because when you use the .../text() xpath selector, the return value might be a selector but the selector is just a wrapper around the string extracted from the your query. And plain text strings do not have any attributes.

I think a similar but slightly better solution would be to simply test for the existence of a <sup> element when you iterate through each selector that is a direct child of the td.day-table-score. If it is not a tag, then you can assume that it is plain text and assign it as the score of a new set, otherwise you can assign it as the tie breaker for the previous set in a dictionary.

For example:

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results"]

    def parse(self, response):
        for tag in response.css("td.day-table-score a"):
            score = {}  # keeps track of score for this table row
            for elem in tag.xpath('./text()|sup'):   
                if elem.re('<sup>'):
                    val = elem.xpath('./text()').get().strip()
                    s = len(score)
                    score[f"set_{s}"]["tiebreaker"] = val
                else:
                    s = len(score)   1
                    val = elem.get().strip()
                    if val:
                        score[f"set_{s}"] = {"score": val, "tiebreaker": None}
            yield score  # the final score collection

OUTPUT

{'set_1': {'score': '75', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '46', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}, 'set_3': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '76', 'tiebreaker': '8'}, 'set_2': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '75', 'tiebreaker': None}}
{'set_1': {'score': '61', 'tiebreaker': None}, 'set_2': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '62', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '76', 'tiebreaker': '1'}, 'set_2': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '62', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '67', 'tiebreaker': '2'}, 'set_3': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '67', 'tiebreaker': '2'}, 'set_2': {'score': '62', 'tiebreaker': None}, 'set_3': {'score': '75', 'tiebreaker': None}}
{'set_1': {'score': '36', 'tiebreaker': None}, 'set_2': {'score': '76', 'tiebreaker': '5'}, 'set_3': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '61', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '46', 'tiebreaker': None}, 'set_2': {'score': '76', 'tiebreaker': '10'}, 'set_3': {'score': '76', 'tiebreaker': '4'}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '60', 'tiebreaker': None}, 'set_2': {'score': '36', 'tiebreaker': None}, 'set_3': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '75', 'tiebreaker': None}, 'set_2': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '62', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '61', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}