I am trying to scrape a website and am successful for the first page. However, am I not managing to scrape the next page.
So far, I am using requests and BeautifulSoup and am getting the content from the first page by using the following code:
r = requests.get(url)
data = soup(r.content, 'html.parser')
This returns some lovely html, the information about pages and referrer I get here are the following:
<div >
<form action="/arcinsys/recherchePagingSelect.action" id="headerPagingForm" method="get" name="headerPagingForm">
<input name="_csrf" type="hidden" value="45cc8dd5-2869-4327-957e-83ffcbe08fba"/>
<span data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId1">
<img id="pagingtestid1" src="/arcinsys/images/aktion_first_w.png"/>
</span>
<span data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId2">
<img id="pagingtestid2" src="/arcinsys/images/aktion_prev_w.png"/>
</span>
<span ><input id="pageposition" maxlength="6" name="pageposition" size="6" style="width: 50px" type="text" value="1"/>
<button id="formSubmitButton2" title="Seite 1 von 2"> / 2 </button></span>
<span data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId3">
<img id="pagingtestid3" src="/arcinsys/images/aktion_next_w.png"/>
</span>
<span data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId4">
<img id="pagingtestid4" src="/arcinsys/images/aktion_last_w.png"/>
</span>
</form>
</div>
I can tell that I am only on page 1 of 2, but how to get to page 2? I have managed to get the session and cookies:
session = requests.session()
cookies = r.cookies
print(session)
for cookie in r.cookies:
print(cookie)
With this as the two results:
<requests.sessions.Session object at 0x0000017E20A7DAF0>
<Cookie JSESSIONID=A4FF49C5577C2A8EFCB0FCD6F2C2D181 for arcinsys.hessen.de/arcinsys>
I also have the referrer url in the html code above of
data-href="/arcinsys/recherchePaging.action?pagingvalues=2"
I have now tried various ways of passing either session id, cookies or referrer along, but nothing has been working so far. I might be doing it wrong and I am also not sure which way is best. Any help with this very much appreciated!
CodePudding user response:
Content is loaded dynamically via an additional request, as you can inspect in DevTools of your browser on the xhr request tab.
You could use a while-loop
, select the element with id="pId3"
what is the next page button and compare its data-href
to the one of element with id="pId4"
what is the last page button, if it is equal break
your loop:
while True:
soup = BeautifulSoup(s.get(url).text)
### extract what need
if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
url = baseUrl soup.select_one('#pId3').get('data-href')
else:
break
Example
import requests
from bs4 import BeautifulSoup
s = requests.session()
s.headers = {'User-Agent': 'Mozilla/5.0'}
baseUrl = 'https://arcinsys.hessen.de'
url='https://arcinsys.hessen.de/arcinsys/einfachsuchen.action?pageName=einfachesuche&methodName=einfach&rechercheBean.defaultfield=&rechercheBean.defaultfield_widget=wort&rechercheBean.von=&rechercheBean.bis=&rechercheBean.einfacheSucheRadioName=alle&__checkbox_rechercheBean.hasdigi=true'
while True:
soup = BeautifulSoup(s.get(url).text)
print([s.text for s in soup.select('td.cell-signature')])
if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
url = baseUrl soup.select_one('#pId3').get('data-href')
else:
break
Output
['HStAM, 17 d', 'HStAM, 340 Stölzel', 'AdJb, A 216, ...', 'ISG FFM, W2-7, 3200', 'UBA Ffm, Na 49, 116', 'UBA Ffm, Na 62, 335', 'ISG FFM, W2-7, 4150', 'ISG FFM, S3, 30135', 'HStAD, G 37, 4776', 'HStAM, 340 Grimm, Ms 272', 'HStAM, 340 von Schwertzell, 859 d', 'HStAD, G 15 Schotten, B 76', 'HStAM, 311/1, B 59', 'StadtA KS, P 1, 914', 'UBA Ffm, Na 67 , 190', 'LWV-Archiv, B 100-10, 531', 'ISG FFM, W2-7, 3201', 'ISG FFM, W2-7, 3202', 'ISG FFM, W2-7, 2340', 'ISG FFM, W2-7, 1121']
...