I am trying to scrape a site with scrapy
and selenium
.
At first I saw the result of [ {{ certificant.FirstName }} {{ certificant.LastName }} ]
So I thought maybe it's because the page is still loading so I added a WebDriverWait
for an button to show before extracting data but I still get the same result.
I do believe the result I got is from template engine do make things dynamic but if so, what should I do to make the scrape to actually work with this?
This is something I have at the moment
import scrapy
from scrapy import Request
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
class PjFx110Spider(scrapy.Spider):
name = "pj_fx110"
ROOT_URL = 'https://fpcanada.ca/findaplanner'
start_urls = [
ROOT_URL
]
def __init__(self):
options = Options()
# options.add_argument("--headless")
self.driver = webdriver.Chrome('./chromedriver', options=options)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
self.driver.get(response.url)
WebDriverWait(self.driver, 3600).until(EC.presence_of_element_located((By.ID, 'btnShowResults')))
lists = response.css('.list-group')
name = lists.xpath('//*[@id="FPlist"]/div/ul[1]/li/span[1]/text()').extract()
print(name, '---------lists----------')
Thank you so much for any suggestions and advices.
CodePudding user response:
I will assume you want to obtain the full list of planners (you did not confirm this). You are asking for an alternative, here it is (quite far from what you initially planned, I imagine):
import requests
import pandas as pd
headers = {
'authority': 'fpcanada.ca',
'path': '/WebServices/AptifyToolsServices.asmx/GetAllCertificants',
'scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'content-length': '0',
'content-type': 'application/json;charset=utf-8',
'cookie': 'ASP.NET_SessionId=e4tuu2t1lk3dpfbata5zuzbb',
'origin': 'https://fpcanada.ca',
'referer': 'https://fpcanada.ca/findaplanner',
'sec-ch-ua': '"Chromium";v="103", ".Not/A)Brand";v="99"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 x-requested-with: XMLHttpRequest'
}
r = requests.post('https://fpcanada.ca/WebServices/AptifyToolsServices.asmx/GetAllCertificants', headers=headers)
df = pd.read_json(r.json()['d'])
df.to_csv('canada_financial_planners.csv')
print(df.head())
This will return a csv file, and a dataframe head, displaying the format of the csv.file, in a minute or so. If using a Jupyter notebook, you may need to run it with
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10