Home > Software design >  How to scrape JavaScript table from website to dataframe?
How to scrape JavaScript table from website to dataframe?

Time:10-06

I am trying to scrape a JavaScript table from a website to a dataframe. The soup outputs only the script location and not access to the table. The MWE and soup output are given below. I am trying to scrape the table from here to a dataframe, is this possible and how?

MWE

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
                Chrome/72.0.3626.28 Safari/537.36'}
session = requests.Session()
website = session.get('https://iborrowdesk.com', headers=headers, timeout=10)
website.raise_for_status()
soup = BeautifulSoup(website.text, 'lxml')
table = soup.find('table', class_='table table-condensed table-hover')
data = pd.read_html(str(table))[0]

Soup output

<html><head><link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/site.webmanifest" rel="manifest"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="#ffffff" name="theme-color"/>
<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.6/flatly/bootstrap.min.css" rel="stylesheet"/>
<meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>IBorrowDesk</title><script src="//cdn.thisiswaldo.com/static/js/9754.js"></script>
</head><body><div ></div><script src="/static/main.bundle.js?39ed89dd02e44899ebb4">
</script></body></html>

CodePudding user response:

You can use requests since they are exposing an api.

import json

import pandas as pd
import requests


def get_data() -> pd.DataFrame:
    url = "https://iborrowdesk.com/api/most_expensive"

    with requests.Session() as request:
        response = request.get(url, timeout=10)
    if response.status_code != 200:
        print(response.raise_for_status())

    data = json.loads(response.text)

    return pd.json_normalize(data=data["results"])


df = get_data()

CodePudding user response:

As Jason Baker mentioned in his post, you can use the API that's provided. Alternatively, you can use Selenium to scrape the data as well. This question (Python webscraping: BeautifulSoup not showing all html source content) is relevant to your question. It contains an explanation of why requests.Session().get(url) is unable to retrieve all of the elements in the DOM. It's because the elements are created using JavaScript, so the page source HTML doesn't initially contain those elements, they're inserted using JavaScript. The question I linked also contains a code snippet in the answers that I've updated to match your question:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

browser = webdriver.Firefox()
browser.get('https://iborrowdesk.com/')
table = browser.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
data = pd.read_html(table)[0]
print(data)

CodePudding user response:

Python,pandas and requests to grab dynamically populated by JS table data from API using right api_url is as follows:

import requests
import pandas as pd

api_url = 'https://iborrowdesk.com/api/most_expensive'
req=requests.get(api_url).json()['results']
df= pd.DataFrame(req)[['country','cusip','latest_available','latest_fee','latest_market_cap','name','symbol','updated']]
print(df)

Output:

country      cusip  ...  symbol              updated
0      usa   42427963  ...   ZIONP  2022-10-05T16:45:03
1      usa  560751549  ...    EVEX  2022-10-05T16:45:03
2      usa  326089294  ...     PCF  2022-10-05T16:45:03
3      usa  568407570  ...    GROV  2022-10-05T16:45:03
4      usa  543625224  ...    VIVK  2022-10-05T16:45:03
5      usa  563316591  ...    CMRA  2022-10-05T16:45:03
6      usa  443914905  ...    KSPN  2022-10-05T16:45:03
7      usa  530965695  ...    BBAI  2022-10-05T16:45:03
8      usa  576125128  ...    MGAM  2022-10-05T16:45:03
9      usa  550356389  ...    ALLG  2022-10-05T16:45:03
10     usa  566361445  ...    SNTI  2022-10-03T16:45:03
11     usa  337888499  ...    LOGC  2022-10-05T16:45:03
12     usa  569569731  ...     PGY  2022-10-04T16:45:03
13     usa  582325897  ...    WEST  2022-10-05T16:45:03
14     usa  575436677  ...    GETY  2022-10-05T16:45:03
15     usa  469692616  ...    EVAX  2022-10-05T16:45:03
16     usa  578663230  ...     TGL  2022-10-05T16:45:03
17     usa  545665918  ...     BWV  2022-10-05T16:45:03
18     usa  478807158  ...    LVTX  2022-10-05T16:45:03
19     usa  211079981  ...   WINSF  2022-10-05T16:45:03
20     usa   16201977  ...     NBH  2022-10-05T16:45:03
21     usa  564701487  ...    BHAT  2022-10-05T16:45:03
22     usa   42511636  ...    AXTG  2022-10-05T16:45:03
23     usa  484429340  ...    NUWE  2022-10-05T16:45:03
24     usa  564931628  ...    BCAN  2022-10-05T16:45:03

[25 rows x 8 columns]
  • Related