Home > Software engineering >  Is there a way to scrape HTML popup tables/charts with python?
Is there a way to scrape HTML popup tables/charts with python?

Time:04-09

I'm currently looking to scrape https://www.bestfightodds.com/ for an MMA machine learning project. I'm specifically looking for the DraftKings opening odds for each fighter which is found by clicking on the odds for a given fighter under the DraftKings column. You are then presented with a popup table that shows how the betting odds have changed over time. The table presents you with the openings odds and the latest (current) odds.

I have no issue scraping the fighter names, but I can't figure out how to scrape the opening odds in the popup table. The HTML code from the popup table only appears in the inspect function when you click on it which is why I get a 'None' when I try to find it in the site's HTML.

This is my code so far:

# Importing packages
from bs4 import BeautifulSoup
import requests

# Specifying website URL
html_text = requests.get('https://www.bestfightodds.com/events/ufc-273-2411').text
soup = BeautifulSoup(html_text, 'lxml')

# Finding values
fighter_names = soup.find_all('span', class_ = 't-b-fcc')
opening_odds = soup.find_all('span, style_ = 'margin-left: 4px; margin-right: 4px;')

for fighter_names in soup.find_all('span', class_ = 't-b-fcc'):
     print (fighter_names.get_text())

Here is a photo of where and how to locate the opening odds. The blue box is where you click to find the red one, which is the one I need to scrape for all fighters.

CodePudding user response:

The pop-ups are triggered by JavaScript, so your scrapper needs to be able to inject JavaScript into the website. I know apify.com uses what is called Headless chrome/chromium automation. You can check out this python library Headless Chrome/Chromium automation library (unofficial port of puppeteer) on GitHub.

CodePudding user response:

Fun little project. The data the server sends are encoded by custom JavaScript function, so you need to use selenium or rewrite the decoding function to Python.

I used js2py to execute the javascript function directly in python (and not use selenium - it rewrites the javascript function to python automatically), but you can rewrite it to Python if you wish:

import json
import js2py
import requests
from bs4 import BeautifulSoup


js_decode_func = r"""function $(e) {
    var t,
        a,
        r,
        s,
        o,
        i,
        l = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 /=',
        n = '',
        d = 0;
    for (e = e.replace(/[^A-Za-z0-9\ \/\=]/g, ''); d < e.length;) t = l.indexOf(e.charAt(d  )) << 2 | (s = l.indexOf(e.charAt(d  ))) >> 4,
        a = (15 & s) << 4 | (o = l.indexOf(e.charAt(d  ))) >> 2,
        r = (3 & o) << 6 | (i = l.indexOf(e.charAt(d  ))),
        n  = String.fromCharCode(t),
        64 != o && (n  = String.fromCharCode(a)),
        64 != i && (n  = String.fromCharCode(r));
    for (var c = '', h = 0, p = c1 = c2 = 0; h < n.length;)(p = n.charCodeAt(h)) < 128 ? (c  = String.fromCharCode(p), h  ) : 191 < p && p < 224 ? (c2 = n.charCodeAt(h   1), c  = String.fromCharCode((31 & p) << 6 | 63 & c2), h  = 2) : (c2 = n.charCodeAt(h   1), c3 = n.charCodeAt(h   2), c  = String.fromCharCode((15 & p) << 12 | (63 & c2) << 6 | 63 & c3), h  = 3);
    var u,
        f,
        m,
        g = '!"#$%&\'()* ,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~',
        y = new String,
        $ = g.length;
    for (u = 0; u < c.length; u  ) m = c.charAt(u),
        0 <= (f = g.indexOf(m)) && (m = g.charAt((f   $ / 2) % $)),
        y  = m;
    return y
}"""


js_get_value_func = r"""function $(e) {
  return 2 <= e ? ' '   Math.round(100 * (e - 1)) : e < 2 ? ''   Math.round( - 100 / (e - 1)) : 'error'
}"""

decode = js2py.eval_js(js_decode_func)
get_value = js2py.eval_js(js_get_value_func)

url = "https://www.bestfightodds.com/"
api_url = "https://www.bestfightodds.com/api/ggd"

params = {"b": "22", "m": "25728", "p": "1"}

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for td in soup.select("td[data-li]"):
    vals = json.loads(td["data-li"])
    if len(vals) != 3 or vals[0] != 22:  # 22 - DraftKings
        continue
    params["b"], params["p"], params["m"] = vals
    name = td.find_previous(class_="t-b-fcc").text
    encoded_text = requests.get(api_url, params=params).text
    data = json.loads(decode(encoded_text))
    first_value = get_value(data[0]["data"][0]["y"])
    print(name, first_value)

Prints:

Alexander Volkanovski -450
Chan Sung Jung  340
Aljamain Sterling  320
Petr Yan -425
Mackenzie Dern  120
Tecia Torres -140
Mark Madsen  130
Vinc Pichel -150
Darian Weeks  190
Ian Garry -235
Mickey Gall  145
Mike Malott. -165
Aspen Ladd  155
Raquel Pennington -180
Anthony Hernandez -180
Josh Fremd  155
Aleksei Oleinik -105
Jared Vanderaa -115
Kay Hansen -150
Piera Rodriguez  130
Daniel Santos  175
Julio Arce -210
Belal Muhammad  150
Vicente Luque -170
Devin Clark -160
William Knight  140
Jordan Leavitt  110
Trey Ogden. -130
Elizeu Zaleski Dos Santos -195
Mounir Lazzez  165
Pat Sabatini -305
T.J. Laramie  240
Mayra Bueno Silva -365
Yanan Wu  280
Lina Akhtar Lansberg  245
Pannie Kianzad -310
Chris Barnett  165
Martin Buday -195
Andre Fialho  150
Miguel Baeza -170
Brandon Jenkins  320
Drakkar Klose -425
Jesse Ronson  110
Rafa Garcia -130
Caio Borralho  115
Gadzhi Omargadzhiev -135
Istela Nunes -190
Sam Hughes  160
Heili Alateng -180
Kevin Croom  155
Carla Esparza  150
Rose Namajunas -170
Glover Teixeira  155
Jiri Prochazka -180
Dustin Poirier -435
Nate Diaz  330
Charles Oliveira -160
Justin Gaethje  140
Gilbert Burns  280
Khamzat Chimaev -365
Arman Tsarukyan -335
Joel Alvarez  260
Calvin Cattar  170
Giga Chikadze -200
  • Related