I am trying to scrape a JavaScript table from a website to a dataframe. The soup outputs only the script location and not access to the table. The MWE and soup output are given below. I am trying to scrape the table from here to a dataframe, is this possible and how?
MWE
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/72.0.3626.28 Safari/537.36'}
session = requests.Session()
website = session.get('https://iborrowdesk.com', headers=headers, timeout=10)
website.raise_for_status()
soup = BeautifulSoup(website.text, 'lxml')
table = soup.find('table', class_='table table-condensed table-hover')
data = pd.read_html(str(table))[0]
Soup output
<html><head><link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/site.webmanifest" rel="manifest"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="#ffffff" name="theme-color"/>
<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.6/flatly/bootstrap.min.css" rel="stylesheet"/>
<meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>IBorrowDesk</title><script src="//cdn.thisiswaldo.com/static/js/9754.js"></script>
</head><body><div ></div><script src="/static/main.bundle.js?39ed89dd02e44899ebb4">
</script></body></html>
CodePudding user response:
You can use requests since they are exposing an api.
import json
import pandas as pd
import requests
def get_data() -> pd.DataFrame:
url = "https://iborrowdesk.com/api/most_expensive"
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.json_normalize(data=data["results"])
df = get_data()
CodePudding user response:
As Jason Baker mentioned in his post, you can use the API that's provided. Alternatively, you can use Selenium to scrape the data as well. This question (Python webscraping: BeautifulSoup not showing all html source content) is relevant to your question. It contains an explanation of why requests.Session().get(url) is unable to retrieve all of the elements in the DOM. It's because the elements are created using JavaScript, so the page source HTML doesn't initially contain those elements, they're inserted using JavaScript. The question I linked also contains a code snippet in the answers that I've updated to match your question:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
browser = webdriver.Firefox()
browser.get('https://iborrowdesk.com/')
table = browser.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
data = pd.read_html(table)[0]
print(data)
CodePudding user response:
Python,pandas and requests to grab dynamically populated by JS table data from API
using right api_url is as follows:
import requests
import pandas as pd
api_url = 'https://iborrowdesk.com/api/most_expensive'
req=requests.get(api_url).json()['results']
df= pd.DataFrame(req)[['country','cusip','latest_available','latest_fee','latest_market_cap','name','symbol','updated']]
print(df)
Output:
country cusip ... symbol updated
0 usa 42427963 ... ZIONP 2022-10-05T16:45:03
1 usa 560751549 ... EVEX 2022-10-05T16:45:03
2 usa 326089294 ... PCF 2022-10-05T16:45:03
3 usa 568407570 ... GROV 2022-10-05T16:45:03
4 usa 543625224 ... VIVK 2022-10-05T16:45:03
5 usa 563316591 ... CMRA 2022-10-05T16:45:03
6 usa 443914905 ... KSPN 2022-10-05T16:45:03
7 usa 530965695 ... BBAI 2022-10-05T16:45:03
8 usa 576125128 ... MGAM 2022-10-05T16:45:03
9 usa 550356389 ... ALLG 2022-10-05T16:45:03
10 usa 566361445 ... SNTI 2022-10-03T16:45:03
11 usa 337888499 ... LOGC 2022-10-05T16:45:03
12 usa 569569731 ... PGY 2022-10-04T16:45:03
13 usa 582325897 ... WEST 2022-10-05T16:45:03
14 usa 575436677 ... GETY 2022-10-05T16:45:03
15 usa 469692616 ... EVAX 2022-10-05T16:45:03
16 usa 578663230 ... TGL 2022-10-05T16:45:03
17 usa 545665918 ... BWV 2022-10-05T16:45:03
18 usa 478807158 ... LVTX 2022-10-05T16:45:03
19 usa 211079981 ... WINSF 2022-10-05T16:45:03
20 usa 16201977 ... NBH 2022-10-05T16:45:03
21 usa 564701487 ... BHAT 2022-10-05T16:45:03
22 usa 42511636 ... AXTG 2022-10-05T16:45:03
23 usa 484429340 ... NUWE 2022-10-05T16:45:03
24 usa 564931628 ... BCAN 2022-10-05T16:45:03
[25 rows x 8 columns]