Home > Software design >  Scraping complicated tables with BeautifulSoup
Scraping complicated tables with BeautifulSoup

Time:07-26

I'm working on a sports betting scraper, however I'm encountering a complicated table. The code below shows how most of the elements look. My main focus is to extract all the text from it (the name of the participants, the date & time, odds, etc)

<tr data-qa="pre-event" ><th scope="row" ><div ><div >
            20:05
          </div> <div >
            24/07
          </div></div> <a href="/cote/sara-errani-paula-ormaechea/27034463/"  data-testid="TENN" title="WTA - Varșovia - Calificări (F)"><div ><div ><div ><span ><!---->
                  Sara Errani
                  <!----></span> <!----></div><div ><span ><!---->
                  Paula Ormaechea
                  <!----></span> <!----></div> <!----></div> <div ><span ><!----> <!----> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1"  data-original-title="null"><path d="M18.545 6H5.455C4.655 6 4 6.668 4 7.5v9c0 .825.655 1.5 1.455 1.5h13.09c.8 0 1.455-.675 1.455-1.5v-9c0-.832-.655-1.5-1.455-1.5zm0 10.5H5.455v-9h13.09v9zM9.818 9v6l5.091-3-5.09-3z"></path></svg> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1"  data-original-title="null"><path d="M7.833 19.5H9.5V8.03H7.833V19.5zm3.334 0h1.666v-15h-1.666v15zm-6.667 0h1.667v-7.941H4.5V19.5zm10 0h1.667V8.03H14.5V19.5zm3.333-7.941V19.5H19.5v-7.941h-1.667z"></path></svg> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1"  data-original-title="null"><path d="M14.2 4.534a.532.532 0 00-.344-.504.503.503 0 00-.572.17l-6.07 7.862a.96.96 0 00-.131.996c.147.33.466.542.817.542h1.928c.142 0 .258.12.258.267v5.6c0 .226.138.428.344.503a.503.503 0 00.572-.17l6.07-7.862a.96.96 0 00.13-.996.899.899 0 00-.817-.542h-1.928a.262.262 0 01-.257-.267v-5.6z"></path></svg> <!----> <!----></span> <!----></div></div> <!----></a></th> <td ><div><section><div ><div >
      Câştigător
    </div> <div ><a href="/cote/sara-errani-paula-ormaechea/27034463/" >
         4
      </a></div></div> <div ><button aria-label="Bet on Sara Errani with odds 1.17." data-selnid="2685084631" data-qa="pre-event-selection"  mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span ><!--fragment#15ac200c85#head-->
    1.17
  <!--fragment#15ac200c85#tail--></span></button><button aria-label="Bet on Paula Ormaechea with odds 4.6." data-selnid="2685084632" data-qa="pre-event-selection"  mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span ><!--fragment#80111e10a3#head-->
    4.60
  <!--fragment#80111e10a3#tail--></span></button> <!----></div></section></div></td><td ></td><td ></td> <td >
         4
      </td></tr>

In this case, what I need are: '20:05; 24/07; Sara Errani; Paula Ormaechea; 4; 1.17; 4.6' the link above 'Sara Errani'.

How can I loop through all the tr elements and extract the relevant data?

CodePudding user response:

With html_doc containing your data from the question:

  1. analyze soup and create mappings of data you want to extract
    • find classes/ids/names of the tags that you want to extract (in this case only classes)
    • define tag and number of them to extract
    • construct your own mappings which will give you possibility to create iteration
  2. iterate through the mappings
    • do the job using your mappings
  3. collect results

Regards...

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

mappings = {
    "time": ["div", "events-list__grid__info__datetime__time", 1],
    "date": ["div", "events-list__grid__info__datetime__date", 1],
    "href": ["a", "GTM-event-link events-list__grid__info__main", 1],
    "name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
    "link": ["a", "table__markets__market__title__markets__link", 1],
    "odd": ["span", "selections__selection__odd", 2]
    }
results = {}

for k, lst in mappings.items():
    for i in range(lst[2]):
        elems = soup.find_all(lst[0], attrs={'class': lst[1]})
        if k != 'href':
            results[k   '_'   str(i   1)] = elems[i].text.strip()
        else:
            results[k   '_'   str(i   1)] = elems[i]['href']

print(results)
#
#   R e s u l t :
#
#   { 
#     'time_1': '20:05', 
#     'date_1': '24/07', 
#     'href_1': '/cote/sara-errani-paula-ormaechea/27034463/', 
#     'name_1': 'Sara Errani', 
#     'name_2': 'Paula Ormaechea', 
#     'link_1': ' 4', 
#     'odd_1': '1.17', 
#     'odd_2': '4.60'
#   }

ADDITION:
With your latest data as html_doc (https://pastebin.com/nx6x00NX)
Added row iterations and event numbers.
Function pretty() by STH (user:56338) from ( How to pretty print nested dictionaries? )
If you can get the table definition soup it will work with this table rows iteration - the rest of the code is the same as it was

from bs4 import BeautifulSoup

def pretty(dct, indent=0):      # function by ---> STH user:56338
    for key, value in dct.items():
        print('\t' * indent   str(key))
        if isinstance(value, dict):
            pretty(value, indent 1)
        else:
            print('\t' * (indent 1)   str(value))
         
soup = BeautifulSoup(html_doc, 'html.parser')

mappings = {
    "time": ["div", "events-list__grid__info__datetime__time", 1],
    "date": ["div", "events-list__grid__info__datetime__date", 1],
    "href": ["a", "GTM-event-link events-list__grid__info__main", 1],
    "name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
    "link": ["a", "table__markets__market__title__markets__link", 1],
    "odd": ["span", "selections__selection__odd", 2]
    }
    
events = {}
results = {}
rows = soup.find_all("tr", attrs={'class': "events-list__grid__event"})
nr = 0
for row_soup in rows:
    for k, lst in mappings.items():
        for i in range(lst[2]):
            elems = row_soup.find_all(lst[0], attrs={'class': lst[1]})
            if k != 'href':
                results[k   '_'   str(i   1)] = elems[i].text.strip()
            else:
                results[k   '_'   str(i   1)] = elems[i]['href']
    nr  = 1
    events['event_'   str(nr)] = results
    results = {}
    
pretty(events)
#
'''     R e s u l t
event_1
        time_1
                22:47
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/sophia-yang-tatum-burger/27018714/
        name_1
                Sophia Yang
        name_2
                Tatum Burger
        link_1
                 4
        odd_1
                1.87
        odd_2
                1.87
event_2
        time_1
                23:30
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/cleo-hutchinson-seha-yu/27018746/
        name_1
                Cleo Hutchinson
        name_2
                Seha YU
        link_1
                 4
        odd_1
                1.87
        odd_2
                1.87
event_3
        time_1
                23:30
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/laura-bente-josie-frazier/27018754/
        name_1
                Laura Bente
        name_2
                Josie Frazier
        link_1
                 4
        odd_1
                1.87
        odd_2
                1.87
event_4
        time_1
                00:00
        date_1
                25/07
        href_1
                https://ro.betano.com/cote/kelly-keller-emma-sun/27018749/
        name_1
                Kelly Keller
        name_2
                Emma Sun
        link_1
                 4
        odd_1
                1.45
        odd_2
                2.60
event_5
        time_1
                00:00
        date_1
                25/07
        href_1
                https://ro.betano.com/cote/nadia-kojonroj-tanvi-narendran/27018750/
        name_1
                Nadia Kojonroj
        name_2
                Tanvi Narendran
        link_1
                 4
        odd_1
                1.87
        odd_2
                1.87
'''
  • Related