So I am trying to scrape the following webpage: https://www.omscentral.com/
The main table there is my item of interest. I want to scrape the table, and all of its content. When I inspect the content of the page, the table is on a table tag, so I figured it would be easy to access it, with the code below.
url = 'https://www.omscentral.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('table')
However, that code only returns the table header. I saw a similar example here, but the solution of switching the parser did not work.
When I look at the soup object in itself, it seems that the requests does not expand the table, and only captures the header. Not too sure what to do here - any advice would be much appreciated!
CodePudding user response:
Content is stored in script tag and rendered dynamically, so you have to extract the data from there.
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
To display in DataFrame simply use:
pd.DataFrame(data)
Example
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.omscentral.com/'
soup = BeautifulSoup(requests.get(url, headers=headers).text)
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
for item in data:
print(item['name'], item.get('officialURL'))
Output
Introduction to Information Security https://omscs.gatech.edu/cs-6035-introduction-to-information-security
Computing for Good https://omscs.gatech.edu/cs-6150-computing-good
Introduction to Operating Systems https://omscs.gatech.edu/cs-6200-introduction-operating-systems
Advanced Operating Systems https://omscs.gatech.edu/cs-6210-advanced-operating-systems
Secure Computer Systems https://omscs.gatech.edu/cs-6238-secure-computer-systems
Computer Networks https://omscs.gatech.edu/cs-6250-computer-networks
...