Home > Enterprise >  How to use selenium and bs4 to scrape html loaded through AJAX
How to use selenium and bs4 to scrape html loaded through AJAX

Time:01-06

I am trying to scrape the job listings from the asda webpage, but whenever I scrape the webpage it doesn't return the elements that I am trying to scrape.

import time
from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--headless')

url = "https://www.asda.jobs/vacancy/find/results/"
browser = webdriver.Chrome(chrome_options=options)

browser.get(url)

browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(30)

soup = BeautifulSoup(browser.page_source, 'html.parser')
data = soup.find("div", {"class": "ListGridContainer"})

print(soup.prettify())

I tried to see if the website used infinite scrolling but I did not know how to get elements from this. Also, it keeps on returning some of the webpage but then the rest that is returned is just the javascript used to load the job listings.

CodePudding user response:

Try this:

import time
from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--headless')

url = "https://www.asda.jobs/vacancy/find/results/"
browser = webdriver.Chrome(chrome_options=options)

browser.get(url)

browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(5)

html = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'html.parser')
data = soup.find("div", {"class": "ListGridContainer"})

print(soup.prettify())

CodePudding user response:

I think using https://www.asda.jobs/vacancy/find/results/ajaxaction/posbrowser_gridhandler/?movejump=[requried page number - 1]&movejump_page=[requried page number] link will make your work easier.

Here in link, there are two parameters for determining which page should be shown.
One is movejump whose value would be 1 less then required page number and another is movejump_page whose value would be required page number.

There are some cookies problem... But I have fixed it in code:

import requests
from bs4 import BeautifulSoup

url="https://www.asda.jobs/vacancy/find/results/ajaxaction/posbrowser_gridhandler/?"

s = requests.Session()
s.get("https://www.asda.jobs/vacancy/find/results/")

pagestamp=s.cookies['earcusession'][5:-8]
url=url f"pagestamp={pagestamp}"

for i in range(3):
    page=s.get(url f"&movejump={i}&movejump_page={i 1}")
    soup=BeautifulSoup(page.content,"lxml")
    print(soup.find("div", {"class": "ListGridContainer"}).prettify())
  • Related