scraping when there is on-click-CodePudding

Hi I'm trying to scrape pdf files from this

my code.

r= requests.get('http://www.italgiure.giustizia.it/sncass/')
soup = BeautifulSoup(r.text, 'html.parser')

pdf_list = soup.find_all('a')
print(pdf_list)
search_html = html.fromstring(r.text)
page_link = search_html.xpath('//*[@id="contentData"]/div[2]/div[1]/div/h3/a/span[1]/span')
print(page_link)

results:

[<a href="accessibilita.html" style="text-decoration:none;font-size:80%;color:white" tabindex="0">Accessibilità</a>, <a accesskey="r" name="results" onclick="$(this).next().focus();" tabindex="-2" title="contenuto"></a>, <a accesskey="1" name="card" onclick="$(this).next().focus();" tabindex="-2" title="documento"></a>, <a  href="javascript:void(0)" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="filename" data-role="content" title="pdf"></span>  <span ><span >Sez.</span> <span  data-arg="szdec" data-role="content"></span> <span  data-arg="kind" data-role="content"></span><span > - <span  data-arg="ssz" data-role="content"></span></span><span >,</span> </span> <span data-arg="tipoprov" data-role="content"></span> <span ><span ><span >n.</span><span data-arg="numcard" data-role="content"></span></span><span data-arg="numdec" data-role="content" style="display:none"></span><span data-arg="numdep" data-role="content" style="display:none"></span> <span ><span > del </span><span data-arg="datdep" data-role="content"></span><span data-arg="ecli" data-role="content" style="font-weight:normal"></span><span data-arg="anno" data-role="content" style="display:none"></span><span >,</span></span> </span> <span ><span >udienza del</span> <span data-arg="datdec" data-role="content"></span><span >,</span></span> <span ><span >Presidente </span><span data-arg="presidente" data-role="content"></span> </span> <span ><span >Relatore </span><span data-arg="relatore" data-role="content"></span> </span> </a>, <a  href="javascript:void(0)" onclick="toTargetText($('.toDocument.txt',$(this)).attr('data-arg'))" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="testoocr" data-role="content" title="testo ocr"></span>  <span data-arg="ocr" data-role="datasubset"> <span data-arg="ocr" data-role="multivaluedcontent">snippet</span> </span> </a>, <a href="http://www.italgiure.giustizia.it" style="color:white;" tabindex="0">ItalgiureWeb</a>] []

In the above results I cannot retrieve the weblinks, which are in the <span data-arg="/xw... I also tried giving span class, which is:

pdf_list = soup.find('span', {'class': 'toDocument pdf'})

the html is

<a href="javascript:void(0)" tabindex="0" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" > <span data-role="content" data-arg="filename" title="pdf"><span  data-arg="/xway/application/nif/clean/hc.dll?verbo=attach&db=snciv&id=./20221107/snciv@s50@a2022@[email protected]"><img  alt="formato pdf" src="pix/pdf.png"></span></span>&nbsp; <span ><span >Sez.</span>&nbsp;<span  data-role="content" data-arg="szdec">QUINTA</span> <span  data-role="content" data-arg="kind">CIVILE</span><span >,</span> </span> <span data-role="content" data-arg="tipoprov">Ordinanza</span> <span ><span ><span >n.</span><span data-role="content" data-arg="numcard">32765</span></span><span style="display:none" data-role="content" data-arg="numdec">32765</span><span style="display:none" data-role="content" data-arg="numdep"></span> <span ><span > del </span><span data-role="content" data-arg="datdep">07/11/2022</span><span style="font-weight:normal" data-role="content" data-arg="ecli"> (ECLI:IT:CASS:2022:32765CIV)</span><span style="display:none" data-role="content" data-arg="anno">2022</span><span >,</span></span> </span> <span ><span >udienza del</span>&nbsp;<span data-role="content" data-arg="datdec"><span style="font-weight:normal">19/10/2022</span></span><span >,</span></span> <span ><span >Presidente </span><span data-role="content" data-arg="presidente">PAOLITTO LIBERATO</span>&nbsp;</span> <span ><span >Relatore </span><span data-role="content" data-arg="relatore">DELL'ORFANO ANTONELLA</span>&nbsp;</span> </a>

Please let me know how to approach this. Thanks in advance.

CodePudding user response：

The files come from a POST request and you need to mimic it to get the files.

For example:

import urllib.parse

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
}

query_url = "http://www.italgiure.giustizia.it/sncass/isapi/hc.dll/sn.solr/sn-collection/select?app.query="

payload = {
    "start": "0",
    "rows": "10",
    "q": "((kind:\"snciv\" OR kind:\"snpen\")) AND szdec:\"F\" AND anno:\"2022\"",
    "wt": "json",
    "indent": "off",
    "sort": "pd desc,numdec desc",
    "fl": "id,filename,szdec,kind,ssz,tipoprov,numcard,numdec,numdep,datdep,ecli,anno,datdec,presidente,relatore,testoocr,ocr",
    "hl": "true",
    "hl.snippets": "4",
    "hl.fragsize": "100",
    "hl.fl": "ocr",
    "hl.q": "nomatch AND szdec:\"F\" AND anno:\"2022\"",
    "hl.maxAnalyzedChars": "1000000",
    "hl.simple.pre": "<em class=\"hit\">",
    "hl.simple.post": "</em>",
}

docs = (
    requests
    .post(query_url, headers=headers, data=payload).json()["response"]["docs"]
)

base_url = "http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id="
for doc in docs:
    print(f'{base_url}{doc["filename"][0].replace(".pdf", ".clean.pdf")}')

This will get you first 10 .pdfs for FERIALE -> 2022.

http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221103/snpen@sF0@a2022@n41566@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221021/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221012/snpen@sF0@a2022@n38545@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]

In order to "select" menu items edit this field:

For example this gets you first 10 files for UNITE for 2017.

    "hl.q": "nomatch AND szdec:\"U\" AND anno:\"2017\""

If you wish to paginate the response, change the value of "start" to, say, 10 to get the next 10 docs.:

    "start": "10"