Hi I'm trying to scrape pdf files from this
my code.
r= requests.get('http://www.italgiure.giustizia.it/sncass/')
soup = BeautifulSoup(r.text, 'html.parser')
pdf_list = soup.find_all('a')
print(pdf_list)
search_html = html.fromstring(r.text)
page_link = search_html.xpath('//*[@id="contentData"]/div[2]/div[1]/div/h3/a/span[1]/span')
print(page_link)
results:
[<a href="accessibilita.html" style="text-decoration:none;font-size:80%;color:white" tabindex="0">Accessibilità</a>, <a accesskey="r" name="results" onclick="$(this).next().focus();" tabindex="-2" title="contenuto"></a>, <a accesskey="1" name="card" onclick="$(this).next().focus();" tabindex="-2" title="documento"></a>, <a href="javascript:void(0)" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="filename" data-role="content" title="pdf"></span> <span ><span >Sez.</span> <span data-arg="szdec" data-role="content"></span> <span data-arg="kind" data-role="content"></span><span > - <span data-arg="ssz" data-role="content"></span></span><span >,</span> </span> <span data-arg="tipoprov" data-role="content"></span> <span ><span ><span >n.</span><span data-arg="numcard" data-role="content"></span></span><span data-arg="numdec" data-role="content" style="display:none"></span><span data-arg="numdep" data-role="content" style="display:none"></span> <span ><span > del </span><span data-arg="datdep" data-role="content"></span><span data-arg="ecli" data-role="content" style="font-weight:normal"></span><span data-arg="anno" data-role="content" style="display:none"></span><span >,</span></span> </span> <span ><span >udienza del</span> <span data-arg="datdec" data-role="content"></span><span >,</span></span> <span ><span >Presidente </span><span data-arg="presidente" data-role="content"></span> </span> <span ><span >Relatore </span><span data-arg="relatore" data-role="content"></span> </span> </a>, <a href="javascript:void(0)" onclick="toTargetText($('.toDocument.txt',$(this)).attr('data-arg'))" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="testoocr" data-role="content" title="testo ocr"></span> <span data-arg="ocr" data-role="datasubset"> <span data-arg="ocr" data-role="multivaluedcontent">snippet</span> </span> </a>, <a href="http://www.italgiure.giustizia.it" style="color:white;" tabindex="0">ItalgiureWeb</a>] []
In the above results I cannot retrieve the weblinks, which are in the <span data-arg="/xw...
I also tried giving span class, which is:
pdf_list = soup.find('span', {'class': 'toDocument pdf'})
the html is
<a href="javascript:void(0)" tabindex="0" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" > <span data-role="content" data-arg="filename" title="pdf"><span data-arg="/xway/application/nif/clean/hc.dll?verbo=attach&db=snciv&id=./20221107/snciv@s50@a2022@[email protected]"><img alt="formato pdf" src="pix/pdf.png"></span></span> <span ><span >Sez.</span> <span data-role="content" data-arg="szdec">QUINTA</span> <span data-role="content" data-arg="kind">CIVILE</span><span >,</span> </span> <span data-role="content" data-arg="tipoprov">Ordinanza</span> <span ><span ><span >n.</span><span data-role="content" data-arg="numcard">32765</span></span><span style="display:none" data-role="content" data-arg="numdec">32765</span><span style="display:none" data-role="content" data-arg="numdep"></span> <span ><span > del </span><span data-role="content" data-arg="datdep">07/11/2022</span><span style="font-weight:normal" data-role="content" data-arg="ecli"> (ECLI:IT:CASS:2022:32765CIV)</span><span style="display:none" data-role="content" data-arg="anno">2022</span><span >,</span></span> </span> <span ><span >udienza del</span> <span data-role="content" data-arg="datdec"><span style="font-weight:normal">19/10/2022</span></span><span >,</span></span> <span ><span >Presidente </span><span data-role="content" data-arg="presidente">PAOLITTO LIBERATO</span> </span> <span ><span >Relatore </span><span data-role="content" data-arg="relatore">DELL'ORFANO ANTONELLA</span> </span> </a>
Please let me know how to approach this. Thanks in advance.
CodePudding user response:
The files come from a POST
request and you need to mimic it to get the files.
For example:
import urllib.parse
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
query_url = "http://www.italgiure.giustizia.it/sncass/isapi/hc.dll/sn.solr/sn-collection/select?app.query="
payload = {
"start": "0",
"rows": "10",
"q": "((kind:\"snciv\" OR kind:\"snpen\")) AND szdec:\"F\" AND anno:\"2022\"",
"wt": "json",
"indent": "off",
"sort": "pd desc,numdec desc",
"fl": "id,filename,szdec,kind,ssz,tipoprov,numcard,numdec,numdep,datdep,ecli,anno,datdec,presidente,relatore,testoocr,ocr",
"hl": "true",
"hl.snippets": "4",
"hl.fragsize": "100",
"hl.fl": "ocr",
"hl.q": "nomatch AND szdec:\"F\" AND anno:\"2022\"",
"hl.maxAnalyzedChars": "1000000",
"hl.simple.pre": "<em class=\"hit\">",
"hl.simple.post": "</em>",
}
docs = (
requests
.post(query_url, headers=headers, data=payload).json()["response"]["docs"]
)
base_url = "http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id="
for doc in docs:
print(f'{base_url}{doc["filename"][0].replace(".pdf", ".clean.pdf")}')
This will get you first 10
.pdfs
for FERIALE
-> 2022
.
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221103/snpen@sF0@a2022@n41566@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221021/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221012/snpen@sF0@a2022@n38545@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]
In order to "select" menu items edit this field:
For example this gets you first 10
files for UNITE
for 2017
.
"hl.q": "nomatch AND szdec:\"U\" AND anno:\"2017\""
If you wish to paginate the response, change the value of "start"
to, say, 10
to get the next 10
docs.:
"start": "10"