I'm trying to scrape the score of the third tennis match in the table on this site.
I need to parse a match score into individual set scores from this html element:
html = (
'''
<td >
<a data-override-transition="" data-ga-category="" data-ga-action="Click" data-ga-label="" data-use-ga="true" >
<!-- Determine set tie break score --> 76 <sup>8</sup>
<!-- Determine set tie break score --> 61
</a>
</td>
'''
)
To simulate the scraping of a HtmlResponse
object I've used the following line for this post:
from scrapy.http.response.html import HtmlResponse
response = HtmlResponse("xyz", body=html, encoding="utf-8")
I can get the raw text string of the match score using this xpath:
response.xpath("normalize-space()").get()
However, it will be quite an extensive job to build a parser to account for the all the different versions of tiebreaks that exist in tennis. What would be much easier is to be able to identify a tiebreak score on the basis that it is always located in a superscript element. As sequence matters the best solution I could come up with was to loop through each line of text, determine whether the line was text or a sup
tag and then assign it to a set score container. For this example let's assume it's a dictionary:
{"set_1_score": 76, "set_1_tiebreak_score": 8, "set_2_score": 61, "set_2_tiebreak_score": None}
As there will be a bit of additional complexity in this process to keep track of set numbers etc I've simplified the code down to just printing whether a line of text is in fact a sup
tag. From looking at the HtmlResponse
object I can see there's an attrib
attribute which I think is what I need. However, I can't seem to access it. The following code:
text_lines = response.xpath("//text()")
for tl in text_lines:
print(tl.attrib)
Gives the following error:
Exception has occurred: AttributeError
'str' object has no attribute 'attrib'
This is especially weird as when I run type(tl)
it is actually a Selector
object not a str
.
Below is also a print of the details of the tl
object:
special variables:
function variables:
class variables:
attrib: 'Traceback (most recent call last):\n File "/Users/philipjoss/.vscode/extensions/ms-python.python-2022.18.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_resolver.py", line 162, in _get_py_dictionary\n def _get_py_dictionary(self, var, names=None, used___dict__=False):\n File "/Users/<me>/opt/miniconda3/envs/capra/lib/python3.9/site-packages/parsel/selector.py", line 387, in attrib\n @property\nAttributeError: \'str\' object has no attribute \'attrib\'\n'
namespaces: {'re': 'http://exslt.org/reg...xpressions', 'set': 'http://exslt.org/sets'}
response: None
root: '\n '
text: 'Traceback (most recent call last):\n File "/Users/<me>/.vscode/extensions/ms-python.python-2022.18.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_resolver.py", line 162, in _get_py_dictionary\n def _get_py_dictionary(self, var, names=None, used___dict__=False):\nAttributeError: text\n'
type: 'html'
_css2xpath: <bound method Selector._css2xpath of <Selector xpath='//text()' data='\n '>>
_csstranslator: <parsel.csstranslator.HTMLTranslator object at 0x7fbfca5cf850>
_default_namespaces: {'re': 'http://exslt.org/reg...xpressions', 'set': 'http://exslt.org/sets'}
_default_type: None
_expr: '//text()'
_get_root: <bound method Selector._get_root of <Selector xpath='//text()' data='\n '>>
_lxml_smart_strings: False
_parser: <class 'lxml.html.HTMLParser'>
_tostring_method: 'html'
I guess this is a two part question:
- Why am I getting the error that I am? It looks like there might be an issue in the
parsel
package? - Is this the best way to cycle through the different lines of the text element to build the dictionary?
CodePudding user response:
The below solution is based on your the above html doc and the 3rd score column
from scrapy.crawler import CrawlerProcess
import scrapy
import re
class Sp1Spider(scrapy.Spider):
name = 'sp1'
start_urls = ['https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results']
custom_settings = {'USER_AGENT': 'Mozilla/5.0'}
def parse(self, response):
for row in response.xpath('//*[@]//tbody//tr'):
yield {
'score': re.sub(r'\s ', '', ''.join(row.xpath('.//*[@]/a//text()').getall()).strip().replace('\r\n',''))
}
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl(Sp1Spider)
process.start()
Output:
{'score': '7562'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '466264'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '76861'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6375'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6161'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6264'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '76163'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6364'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6262'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6367263'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6464'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6464'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6462'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6464'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6726275'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '3676561'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6463'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6162'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '467610764'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6364'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6463'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6462'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '603664'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '7561'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6262'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6162'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6164'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '466364'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '67576461'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6030(RET)'}
2022-11-05 22:02:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results>
{'score': '6164'}
2022-11-05 22:02:24 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-05 22:02:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 235,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 41729,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.673797,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 5, 16, 2, 24, 775020),
'httpcompression/response_bytes': 258687,
'httpcompression/response_count': 1,
'item_scraped_count': 31,
CodePudding user response:
The reason you are getting the error is because when you use the .../text()
xpath selector, the return value might be a selector but the selector is just a
wrapper around the string extracted from the your query. And plain text strings
do not have any attributes.
I think a similar but slightly better solution would be to simply test for the
existence of a <sup>
element when you iterate through each selector that is a
direct child of the td.day-table-score
. If it is not a tag, then you can
assume that it is plain text and assign it as the score of a new set, otherwise
you can assign it as the tie breaker for the previous set in a dictionary.
For example:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["https://www.atptour.com/en/scores/archive/waikoloa/574/2000/results"]
def parse(self, response):
for tag in response.css("td.day-table-score a"):
score = {} # keeps track of score for this table row
for elem in tag.xpath('./text()|sup'):
if elem.re('<sup>'):
val = elem.xpath('./text()').get().strip()
s = len(score)
score[f"set_{s}"]["tiebreaker"] = val
else:
s = len(score) 1
val = elem.get().strip()
if val:
score[f"set_{s}"] = {"score": val, "tiebreaker": None}
yield score # the final score collection
OUTPUT
{'set_1': {'score': '75', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '46', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}, 'set_3': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '76', 'tiebreaker': '8'}, 'set_2': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '75', 'tiebreaker': None}}
{'set_1': {'score': '61', 'tiebreaker': None}, 'set_2': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '62', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '76', 'tiebreaker': '1'}, 'set_2': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '62', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '67', 'tiebreaker': '2'}, 'set_3': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '67', 'tiebreaker': '2'}, 'set_2': {'score': '62', 'tiebreaker': None}, 'set_3': {'score': '75', 'tiebreaker': None}}
{'set_1': {'score': '36', 'tiebreaker': None}, 'set_2': {'score': '76', 'tiebreaker': '5'}, 'set_3': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '61', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '46', 'tiebreaker': None}, 'set_2': {'score': '76', 'tiebreaker': '10'}, 'set_3': {'score': '76', 'tiebreaker': '4'}}
{'set_1': {'score': '63', 'tiebreaker': None}, 'set_2': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '63', 'tiebreaker': None}}
{'set_1': {'score': '64', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '60', 'tiebreaker': None}, 'set_2': {'score': '36', 'tiebreaker': None}, 'set_3': {'score': '64', 'tiebreaker': None}}
{'set_1': {'score': '75', 'tiebreaker': None}, 'set_2': {'score': '61', 'tiebreaker': None}}
{'set_1': {'score': '62', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}
{'set_1': {'score': '61', 'tiebreaker': None}, 'set_2': {'score': '62', 'tiebreaker': None}}