Home > Net >  Scrapy AJAX send request to get response of generated HTML
Scrapy AJAX send request to get response of generated HTML

Time:08-03

I want to crawl data from sites like this.

To show all offers manually the "Show More Results" button at the bottom of the page has to be clicked until all offers are displayed. Upon clicking it an AJAX request is sent to the server which in response to this event shows more HTML (which I want to scrape).

The request copy URL looks like this:

https://www.cardmarket.com/en/Magic/AjaxAction

but I don't want to leave the starting URL but instead just load more. The response also doesn't provide JSON or HTML and always looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<ajaxResponse><rows>PGRpdiBpZ...</rows><newPage>1</newPage></ajaxResponse>

Other answers to similar questions usually got a JSON as the response or straight up HTML or recommended using Beautiful Soup but I am concerned about crawling speed as well.

How could I load the missing HTML and get the data in an efficient way?

CodePudding user response:

The below example with selenium, bs4 and pandas is working smoothly where I have to use Javascript execution to click and complete the show more result.

Example:

import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)

url = 'https://www.cardmarket.com/en/Magic/Products/Singles/Exodus/Survival-of-the-Fittest'
driver.get(url)
time.sleep(5)

lst=[]
while True:

    soup=BeautifulSoup(driver.page_source,'lxml')
    for card in soup.select('[] > a'):
        lst.append({'name': card.text})
 
    try:     
        driver.execute_script("arguments[0].scrollIntoView();",driver.find_element(By.XPATH,'//*[@id="loadMore"]/button/span[2]'))
        pos= driver.find_element(By.XPATH,'//*[@id="loadMore"]/button/span[2]').click()
        
        time.sleep(2)
    except:
        break

df=pd.DataFrame(lst)
print(df)

Output:

                 name
0               Lugones
1              odaJoana
2        Arcana-Trieste
3        Arcana-Trieste
4              Impavido
..                  ...
145              yoeril
146  JacobMartinNielsen
147               Artia
148               Nanau
149           magictuga

[150 rows x 1 columns]
  • Related