I'm using the following code to scrape a web page:
import scrapy
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException, StaleElementReferenceException
class JornaleconomicoSpider(scrapy.Spider):
name = 'jornaleconomico'
allowed_domains = ['jornaleconomico.pt']
start_urls = ['https://jornaleconomico.pt/categoria/economia']
def parse(self, response):
options = Options()
driver_path = '###' #Your Chrome Webdriver Path
browser_path = '###' #Your Google Chrome Path
options.binary_location = browser_path
options.add_experimental_option("detach", True)
self.driver = webdriver.Chrome(options=options, executable_path=driver_path)
self.driver.get(response.url)
ignored_exceptions=(NoSuchElementException,StaleElementReferenceException,)
wait = WebDriverWait(self.driver, 120, ignored_exceptions=ignored_exceptions)
self.new_src = None
self.new_response = None
i=0
while i<10:
# click next link
try:
element = wait.until(EC.element_to_be_clickable((By.XPATH, '*//div[@]')))
self.driver.execute_script("arguments[0].click();", element)
self.new_src = self.driver.page_source
self.new_response = response.replace(body=self.new_src)
i = 1
except TimeoutException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
# grab the data
headlines = self.new_response.xpath('*//h1[@]/a/text()').extract()
for headline in headlines:
yield {
'text': headline
}
The code above is supposed to click 10 times on Ver mais artigos (See More Articles) and get the text from all the headlines, but it's getting only the first original nine headlines. I checked the page source code on Chrome Selenium (using the options.add_experimental_option("detach", True)
line to freeze the Selenium window), and I figured out that the page source is the same as the original page, before the clicks. For me, this shouldn't be happening, since in that same Selenium window I can correctly inspect all articles, not just the first nine, and even using WebDriveWait
is not preventing this from happening. How to solve this?
CodePudding user response:
You don't actually need to use Selenium for this very-easy-to-fetch website. Here is what I would do if I need data from there.
Testing with postman
POST https://domain.pt/wp-admin/admin-ajax.php
content-type: application/x-www-form-urlencoded; charset=UTF-8
pragma: no-cache
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36
x-requested-with: XMLHttpRequest
action=je_pagination&nonce=f2e925cd72&je_offset=9&je_term=economia
First 9 records of the blog are printed with the link and pagination can be done using above postman sample, just change the 'je_offset' to [9,18,27,etc] and updating the 'nonce'.
Every time you load the page, you need to get new 'nonce' from html. This is what the website shows on every page, try using re.search to get 'ajax_nonce' value.
<script type='text/javascript' id='je-main-js-extra'>
/* <![CDATA[ */
var ajax_object = {"ajax_url":"https:\/\/domain.pt\/wp-admin\/admin-ajax.php","ajax_nonce":"f2e925cd72"};
/* ]]> */
</script>
Try load the page using requests.get and paginate using requests.post - this should make your job super easy and much faster than selenium.
CodePudding user response:
Here is the (almost) complete solution:
from json import loads, dumps
from requests import get, post
from lxml.html import fromstring
from re import search, sub, findall
headerz = {
"accept": "application/json, text/javascript, */*; q=0.01",
"accept-language": "en-US,en;q=0.9",
"sec-ch-ua": "'Chromium';v='106', 'Google Chrome';v='106', 'Not;A=Brand';v='99'",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "'Linux'",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-site",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://jornaleconomico.pt/categoria/economia"
pag_href = "https://jornaleconomico.pt/wp-admin/admin-ajax.php"
page_count = 0
r = get(url)
html = fromstring(r.content.decode())
rawnonce = html.xpath("//script[@id='je-main-js-extra']/text()")
# print first 9 records
for p in html.xpath("//div[contains(@class,'je-posts-container')]//h1[contains(@class,'je-post-title')]/a"):
ptitle = p.xpath("./text()")
if isinstance(ptitle, list):
post_title = ptitle[0]
post_href = p.xpath("./@href")[0]
print (post_href)
# pagination
while True:
page_count = 9
pag_params = {
"action":"je_pagination",
"nonce": "",
"je_offset": page_count,
"je_term": "economia"
}
r = post(pag_href, headers=headerz, data=pag_params)
jdata = r.json()
if (jdata and 'data' in jdata):
jdata = jdata['data']['posts']
html = fromstring(jdata)
for p in html.xpath("//h1[contains(@class,'je-post-title')]/a"):
ptitle = p.xpath("./text()")
if isinstance(ptitle, list):
post_title = ptitle[0]
post_href = p.xpath("./@href")[0]
print (post_href)
else:
break
The output looks like:
https://jornaleconomico.pt/noticias/ministro-das-financas-diz-que-o-governo-esta-a-acompanhar-de-forma-atenta-inflacao-dos-produtos-alimentares-989146
https://jornaleconomico.pt/noticias/bancos-amortizam-antecipadamente-pagamento-dos-ltro-ao-bce-no-valor-de-16-mil-milhoes-989098
https://jornaleconomico.pt/noticias/je-podcast-ouca-aqui-as-noticias-mais-importantes-desta-terca-feira-51-988633
https://jornaleconomico.pt/noticias/prestacao-da-casa-sobe-quase-200-euros-para-creditos-de-150-mil-euros-a-6-meses-989134
https://jornaleconomico.pt/noticias/portugal-2020-atinge-85-de-execucao-e-116-de-compromisso-ate-dezembro-989132
https://jornaleconomico.pt/noticias/crescimento-do-pib-de-67-da-mais-confianca-para-desempenho-de-2023-diz-fernando-medina-989124
https://jornaleconomico.pt/noticias/apesar-dos-reforcos-salario-minimo-portugues-continua-a-meio-da-tabela-na-europa-988979
https://jornaleconomico.pt/noticias/atividade-turistica-dormidas-aumentaram-863-face-a-2021-988958
https://jornaleconomico.pt/noticias/producao-industrial-cresceu-25-em-dezembro-988947
https://jornaleconomico.pt/noticias/pib-cresce-36-na-ue-e-35-na-zona-euro-988921
https://jornaleconomico.pt/noticias/fundo-soberano-da-noruega-regista-maiores-perdas-desde-2008-988905
https://jornaleconomico.pt/noticias/economia-do-reino-unido-e-a-unica-do-g7-com-perspectivas-de-crescimento-negativo-988900
https://jornaleconomico.pt/noticias/economia-portuguesa-cresceu-67-em-2022-988868
https://jornaleconomico.pt/noticias/revista-de-imprensa-nacional-as-noticias-que-estao-a-marcar-esta-terca-feira-48-988814
https://jornaleconomico.pt/noticias/economia-chinesa-com-fortes-perspetivas-de-crescimento-988855
https://jornaleconomico.pt/noticias/fmi-reve-em-alta-as-previsoes-globais-de-crescimento-global-para-2023-e-agradece-a-china-988823
https://jornaleconomico.pt/noticias/alemanha-vendas-a-retalho-registam-a-maior-queda-desde-abril-de-2021-988817
https://jornaleconomico.pt/noticias/je-bom-dia-ine-divulga-dados-sobre-a-inflacao-e-a-economia-988416
https://jornaleconomico.pt/noticias/economia-francesa-cresce-26-em-2022-988767
https://jornaleconomico.pt/noticias/topo-da-agenda-o-que-nao-pode-perder-nos-mercados-e-na-economia-esta-terca-feira-31-988687
https://jornaleconomico.pt/noticias/auditoria-da-igf-ao-sifide-deteta-319-milhoes-de-euros-em-credito-fiscal-indevido-988699
https://jornaleconomico.pt/noticias/ministerio-das-infraestruturas-esta-a-acompanhar-subida-de-precos-das-operadoras-988697
https://jornaleconomico.pt/noticias/economistas-preveem-crescimento-do-pib-entre-66-e-68-em-2022-988681
https://jornaleconomico.pt/noticias/queda-do-pib-em-cadeia-na-alemanha-faz-soar-alarmes-de-recessao-na-zona-euro-de-novo-988583
https://jornaleconomico.pt/noticias/je-podcast-ouca-aqui-as-noticias-mais-importantes-desta-segunda-feira-49-988067
https://jornaleconomico.pt/noticias/jmj-investimentos-da-igreja-do-governo-e-dos-municipios-somam-pelo-menos-155-milhoes-de-euros-988637
https://jornaleconomico.pt/noticias/da-energia-europeia-a-economia-chinesa-veja-as-escolhas-da-semana-no-mercados-em-acao-988544
https://jornaleconomico.pt/noticias/riscos-de-uma-nova-moeda-comum-para-brasil-e-argentina-ouca-o-podcast-atlantic-connection-988395
https://jornaleconomico.pt/noticias/sindicatos-reunem-se-hoje-com-governo-para-tentar-evitar-greve-na-cp-e-ip-988622
https://jornaleconomico.pt/noticias/fundo-europeu-para-os-media-e-informacao-abre-novos-concursos-988564
https://jornaleconomico.pt/noticias/pt2020-portugal-entre-paises-que-mais-executam-fundos-europeus-988590
https://jornaleconomico.pt/noticias/maiores-bancos-espanhois-preparam-se-para-contestar-taxa-sobre-lucros-caidos-do-ceu-988545