I want to scrape data from https://ksanahealth.com/mental-health-blog/ website .
I am trying to access each blog and then click on the link and scrape the details on the details page of a given blog.
I tried to use BeautifulSoup but it returned no data, and I realized the data was loaded dynamically with JavaScript.
Then I tried to use Selenium to scrape it and this the code I came up with:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
service = Service('/usr/bin/chromedrivers')
service.start()
driver = webdriver.Remote(service.service_url)
driver.get('https://ksanahealth.com/mental-health-blog/');
driver.quit()
Unfortunately, my code returns no results.
How best can I improve it so that I get the desired results from the blog?
CodePudding user response:
You don't need selenium for this. When a page is loaded dynamically, you can look up in Network tab which urls are being accessed. The following code will get you started - returning a dataframe with blog title & url. You can further access those urls. Do tell if you need guidance.
The code is below:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
'accept': 'application/json'
}
df_list = []
for x in range(1, 5):
r = requests.get(f'https://ksanahealth.com/wp-admin/admin-ajax.php?id=&post_id=107&slug=mental-health-blog&canonical_url=https://ksanahealth.com/mental-health-blog/&posts_per_page=10&page={x}&offset=0&post_type=post&repeater=default&seo_start_page=1&preloaded=false&preloaded_amount=0&order=DESC&orderby=date&action=alm_get_posts&query_type=standard', headers=headers)
soup = BeautifulSoup(r.json()['html'], 'html.parser')
for y in soup.select('div.post-item'):
df_list.append((y.select_one('h4').text.strip(), y.select_one('a.more-link').get('href')))
df = pd.DataFrame(df_list, columns = ['Title', 'URL'])
print(df)
This returns:
Title URL
0 Addressing the Youth Mental Health Crisis Requ... https://www.hmpgloballearningnetwork.com/site/...
1 Remote work: What does it mean for local offic... https://www.klcc.org/2022-02-23/remote-work-wh...
2 Second Nature? https://www.oregonbusiness.com/article/tech/it...
3 6 Benefits of Continuous Behavioral Health Mea... https://ksanahealth.com/post/6-benefits-of-con...
4 A New Level of Measurement-Based Care https://ksanahealth.com/post/a-new-level-of-me...
5 4 Ways Continuous Behavioral Health Measuremen... https://ksanahealth.com/post/4-ways-continuous.
[....]