I'm working on a sports betting scraper, however I'm encountering a complicated table. The code below shows how most of the elements look. My main focus is to extract all the text from it (the name of the participants, the date & time, odds, etc)
<tr data-qa="pre-event" ><th scope="row" ><div ><div >
20:05
</div> <div >
24/07
</div></div> <a href="/cote/sara-errani-paula-ormaechea/27034463/" data-testid="TENN" title="WTA - Varșovia - Calificări (F)"><div ><div ><div ><span ><!---->
Sara Errani
<!----></span> <!----></div><div ><span ><!---->
Paula Ormaechea
<!----></span> <!----></div> <!----></div> <div ><span ><!----> <!----> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" data-original-title="null"><path d="M18.545 6H5.455C4.655 6 4 6.668 4 7.5v9c0 .825.655 1.5 1.455 1.5h13.09c.8 0 1.455-.675 1.455-1.5v-9c0-.832-.655-1.5-1.455-1.5zm0 10.5H5.455v-9h13.09v9zM9.818 9v6l5.091-3-5.09-3z"></path></svg> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" data-original-title="null"><path d="M7.833 19.5H9.5V8.03H7.833V19.5zm3.334 0h1.666v-15h-1.666v15zm-6.667 0h1.667v-7.941H4.5V19.5zm10 0h1.667V8.03H14.5V19.5zm3.333-7.941V19.5H19.5v-7.941h-1.667z"></path></svg> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" data-original-title="null"><path d="M14.2 4.534a.532.532 0 00-.344-.504.503.503 0 00-.572.17l-6.07 7.862a.96.96 0 00-.131.996c.147.33.466.542.817.542h1.928c.142 0 .258.12.258.267v5.6c0 .226.138.428.344.503a.503.503 0 00.572-.17l6.07-7.862a.96.96 0 00.13-.996.899.899 0 00-.817-.542h-1.928a.262.262 0 01-.257-.267v-5.6z"></path></svg> <!----> <!----></span> <!----></div></div> <!----></a></th> <td ><div><section><div ><div >
Câştigător
</div> <div ><a href="/cote/sara-errani-paula-ormaechea/27034463/" >
4
</a></div></div> <div ><button aria-label="Bet on Sara Errani with odds 1.17." data-selnid="2685084631" data-qa="pre-event-selection" mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span ><!--fragment#15ac200c85#head-->
1.17
<!--fragment#15ac200c85#tail--></span></button><button aria-label="Bet on Paula Ormaechea with odds 4.6." data-selnid="2685084632" data-qa="pre-event-selection" mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span ><!--fragment#80111e10a3#head-->
4.60
<!--fragment#80111e10a3#tail--></span></button> <!----></div></section></div></td><td ></td><td ></td> <td >
4
</td></tr>
In this case, what I need are: '20:05; 24/07; Sara Errani; Paula Ormaechea; 4; 1.17; 4.6' the link above 'Sara Errani'.
How can I loop through all the tr elements and extract the relevant data?
CodePudding user response:
With html_doc containing your data from the question:
- analyze soup and create mappings of data you want to extract
- find classes/ids/names of the tags that you want to extract (in this case only classes)
- define tag and number of them to extract
- construct your own mappings which will give you possibility to create iteration
- iterate through the mappings
- do the job using your mappings
- collect results
Regards...
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
mappings = {
"time": ["div", "events-list__grid__info__datetime__time", 1],
"date": ["div", "events-list__grid__info__datetime__date", 1],
"href": ["a", "GTM-event-link events-list__grid__info__main", 1],
"name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
"link": ["a", "table__markets__market__title__markets__link", 1],
"odd": ["span", "selections__selection__odd", 2]
}
results = {}
for k, lst in mappings.items():
for i in range(lst[2]):
elems = soup.find_all(lst[0], attrs={'class': lst[1]})
if k != 'href':
results[k '_' str(i 1)] = elems[i].text.strip()
else:
results[k '_' str(i 1)] = elems[i]['href']
print(results)
#
# R e s u l t :
#
# {
# 'time_1': '20:05',
# 'date_1': '24/07',
# 'href_1': '/cote/sara-errani-paula-ormaechea/27034463/',
# 'name_1': 'Sara Errani',
# 'name_2': 'Paula Ormaechea',
# 'link_1': ' 4',
# 'odd_1': '1.17',
# 'odd_2': '4.60'
# }
ADDITION:
With your latest data as html_doc (https://pastebin.com/nx6x00NX)
Added row iterations and event numbers.
Function pretty() by STH (user:56338) from ( How to pretty print nested dictionaries? )
If you can get the table definition soup it will work with this table rows iteration - the rest of the code is the same as it was
from bs4 import BeautifulSoup
def pretty(dct, indent=0): # function by ---> STH user:56338
for key, value in dct.items():
print('\t' * indent str(key))
if isinstance(value, dict):
pretty(value, indent 1)
else:
print('\t' * (indent 1) str(value))
soup = BeautifulSoup(html_doc, 'html.parser')
mappings = {
"time": ["div", "events-list__grid__info__datetime__time", 1],
"date": ["div", "events-list__grid__info__datetime__date", 1],
"href": ["a", "GTM-event-link events-list__grid__info__main", 1],
"name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
"link": ["a", "table__markets__market__title__markets__link", 1],
"odd": ["span", "selections__selection__odd", 2]
}
events = {}
results = {}
rows = soup.find_all("tr", attrs={'class': "events-list__grid__event"})
nr = 0
for row_soup in rows:
for k, lst in mappings.items():
for i in range(lst[2]):
elems = row_soup.find_all(lst[0], attrs={'class': lst[1]})
if k != 'href':
results[k '_' str(i 1)] = elems[i].text.strip()
else:
results[k '_' str(i 1)] = elems[i]['href']
nr = 1
events['event_' str(nr)] = results
results = {}
pretty(events)
#
''' R e s u l t
event_1
time_1
22:47
date_1
24/07
href_1
https://ro.betano.com/cote/sophia-yang-tatum-burger/27018714/
name_1
Sophia Yang
name_2
Tatum Burger
link_1
4
odd_1
1.87
odd_2
1.87
event_2
time_1
23:30
date_1
24/07
href_1
https://ro.betano.com/cote/cleo-hutchinson-seha-yu/27018746/
name_1
Cleo Hutchinson
name_2
Seha YU
link_1
4
odd_1
1.87
odd_2
1.87
event_3
time_1
23:30
date_1
24/07
href_1
https://ro.betano.com/cote/laura-bente-josie-frazier/27018754/
name_1
Laura Bente
name_2
Josie Frazier
link_1
4
odd_1
1.87
odd_2
1.87
event_4
time_1
00:00
date_1
25/07
href_1
https://ro.betano.com/cote/kelly-keller-emma-sun/27018749/
name_1
Kelly Keller
name_2
Emma Sun
link_1
4
odd_1
1.45
odd_2
2.60
event_5
time_1
00:00
date_1
25/07
href_1
https://ro.betano.com/cote/nadia-kojonroj-tanvi-narendran/27018750/
name_1
Nadia Kojonroj
name_2
Tanvi Narendran
link_1
4
odd_1
1.87
odd_2
1.87
'''