How to efficiently scrap data from dynamic websites using Selenium?-CodePudding

I want to scrape data from https://ksanahealth.com/mental-health-blog/ website .
I am trying to access each blog and then click on the link and scrape the details on the details page of a given blog.

I tried to use BeautifulSoup but it returned no data, and I realized the data was loaded dynamically with JavaScript.
Then I tried to use Selenium to scrape it and this the code I came up with:

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

service = Service('/usr/bin/chromedrivers')

service.start()

driver = webdriver.Remote(service.service_url)

driver.get('https://ksanahealth.com/mental-health-blog/');

driver.quit()

Unfortunately, my code returns no results.

How best can I improve it so that I get the desired results from the blog?

CodePudding user response：

You don't need selenium for this. When a page is loaded dynamically, you can look up in Network tab which urls are being accessed. The following code will get you started - returning a dataframe with blog title & url. You can further access those urls. Do tell if you need guidance.

The code is below:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
           'accept': 'application/json'
          }

df_list = []
for x in range(1, 5):
    r = requests.get(f'https://ksanahealth.com/wp-admin/admin-ajax.php?id=&post_id=107&slug=mental-health-blog&canonical_url=https://ksanahealth.com/mental-health-blog/&posts_per_page=10&page={x}&offset=0&post_type=post&repeater=default&seo_start_page=1&preloaded=false&preloaded_amount=0&order=DESC&orderby=date&action=alm_get_posts&query_type=standard', headers=headers)
    soup = BeautifulSoup(r.json()['html'], 'html.parser')
    for y in soup.select('div.post-item'):
        df_list.append((y.select_one('h4').text.strip(), y.select_one('a.more-link').get('href')))

df = pd.DataFrame(df_list, columns = ['Title', 'URL'])
print(df)

This returns:

Title   URL
0   Addressing the Youth Mental Health Crisis Requ...   https://www.hmpgloballearningnetwork.com/site/...
1   Remote work: What does it mean for local offic...   https://www.klcc.org/2022-02-23/remote-work-wh...
2   Second Nature?  https://www.oregonbusiness.com/article/tech/it...
3   6 Benefits of Continuous Behavioral Health Mea...   https://ksanahealth.com/post/6-benefits-of-con...
4   A New Level of Measurement-Based Care   https://ksanahealth.com/post/a-new-level-of-me...
5   4 Ways Continuous Behavioral Health Measuremen...   https://ksanahealth.com/post/4-ways-continuous.
[....]