I want to crawl data from sites like this.
To show all offers manually the "Show More Results" button at the bottom of the page has to be clicked until all offers are displayed. Upon clicking it an AJAX request is sent to the server which in response to this event shows more HTML (which I want to scrape).
The request copy URL looks like this:
https://www.cardmarket.com/en/Magic/AjaxAction
but I don't want to leave the starting URL but instead just load more. The response also doesn't provide JSON or HTML and always looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<ajaxResponse><rows>PGRpdiBpZ...</rows><newPage>1</newPage></ajaxResponse>
Other answers to similar questions usually got a JSON as the response or straight up HTML or recommended using Beautiful Soup but I am concerned about crawling speed as well.
How could I load the missing HTML and get the data in an efficient way?
CodePudding user response:
The below example with selenium, bs4 and pandas
is working smoothly where I have to use Javascript execution to click and complete the show more result
.
Example:
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
url = 'https://www.cardmarket.com/en/Magic/Products/Singles/Exodus/Survival-of-the-Fittest'
driver.get(url)
time.sleep(5)
lst=[]
while True:
soup=BeautifulSoup(driver.page_source,'lxml')
for card in soup.select('[] > a'):
lst.append({'name': card.text})
try:
driver.execute_script("arguments[0].scrollIntoView();",driver.find_element(By.XPATH,'//*[@id="loadMore"]/button/span[2]'))
pos= driver.find_element(By.XPATH,'//*[@id="loadMore"]/button/span[2]').click()
time.sleep(2)
except:
break
df=pd.DataFrame(lst)
print(df)
Output:
name
0 Lugones
1 odaJoana
2 Arcana-Trieste
3 Arcana-Trieste
4 Impavido
.. ...
145 yoeril
146 JacobMartinNielsen
147 Artia
148 Nanau
149 magictuga
[150 rows x 1 columns]