Home > Back-end >  scraping dynamic JavaScript table with python
scraping dynamic JavaScript table with python

Time:12-06

I'm trying to scrape this website: https://madduxsports.com/college-basketball-lines.php
I'm very new to python and scraping, I believe this website has a table generated with JavaScript.
I'm looking to scrape just the first 7 columns. I've tried

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://madduxsports.com/college-basketball-lines.php")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
script_tags = soup.find_all("script")
print(script_tags)

This will get everything with the <script> tag which has the table data in it but I don't know how to get the first 7 columns.

Thanks for the help

CodePudding user response:

You could get it through the request directly (but you'll need to do a bit of manipulation of the html escape characters and what not. This gets you the same data as if we pulled it from the <script> tag. I can show you how to get it that way as well if you'd like, but this is a better way in my opinion.

import requests
import pandas as pd

url = 'https://madduxsports.com/newodds/v2/scheduler-ajax.php'
payload = {
'timezone': 'America/New_York',
'is_first_request': '0',
'league_id': '4',
'sport_id': '2',
'period_id': '1'}


jsonData = requests.post(url, data=payload).json()

# Everything above is the to get the data
# jsonData is the json you see in the <script> tag


odds = jsonData['odds']
schedulers = jsonData['schedulers']

odds_df = pd.json_normalize(odds)
schedulers_df = pd.json_normalize(schedulers)

names_dict = {}
for each in odds:
    names_dict[each['id']] = each['name']

cols = []
for col in schedulers_df:
    for k, v in names_dict.items():
        col = col.replace(str(k),v)
        
    cols.append(col)

schedulers_df.columns = cols

cols = ['date','team_ids', 

'team_names','score.away_score','score.home_score',
        'score.description','opener.1.away','opener.1.home']

odds_cols = [x for x in schedulers_df.columns if ('1.away' in x or '1.home' in x) and ('class' not in x)]

df = schedulers_df[cols   odds_cols]

Output:

print(df)
                    date          team_ids  ... odds.SIA.1.away odds.SIA.1.home
0    2021-12-03 00:00:00  306123<br>306124  ...     143&frac12;      -1&frac12;
1    2021-12-03 00:00:00  306127<br>306128  ...     142&frac12;              11
2    2021-12-03 00:00:00  306129<br>306130  ...  126&frac12;u12      -5&frac12;
3    2021-12-03 00:00:00  306131<br>306132  ...              17     146&frac12;
4    2021-12-03 01:00:00  306133<br>306134  ...      -2&frac12;     135&frac12;
..                   ...               ...  ...             ...             ...
107  2021-12-04 07:50:00  396155<br>396156  ...                                
108  2021-12-04 07:50:00  396157<br>396158  ...                                
109  2021-12-04 07:50:00  396159<br>396160  ...                                
110  2021-12-04 07:50:00      9875<br>9876  ...                                
111  2021-12-04 07:50:00      9877<br>9878  ...                                

[112 rows x 22 columns]
  • Related