How to navigate to next page with session id and/or refererr while url remains static?-CodePudding

I am trying to scrape a website and am successful for the first page. However, am I not managing to scrape the next page.

So far, I am using requests and BeautifulSoup and am getting the content from the first page by using the following code:

r = requests.get(url)
data = soup(r.content, 'html.parser')

This returns some lovely html, the information about pages and referrer I get here are the following:

<div >
<form action="/arcinsys/recherchePagingSelect.action" id="headerPagingForm" method="get" name="headerPagingForm">
<input name="_csrf" type="hidden" value="45cc8dd5-2869-4327-957e-83ffcbe08fba"/>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId1">
<img id="pagingtestid1" src="/arcinsys/images/aktion_first_w.png"/>  
                  
                  
                  
                  
                </span>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId2">
<img id="pagingtestid2" src="/arcinsys/images/aktion_prev_w.png"/>  
                  
                  
                  
                </span>
<span ><input id="pageposition" maxlength="6" name="pageposition" size="6" style="width: 50px" type="text" value="1"/>
<button id="formSubmitButton2" title="Seite 1 von 2">  / 2 </button></span>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId3">
<img id="pagingtestid3" src="/arcinsys/images/aktion_next_w.png"/>  
                  
                  
                </span>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId4">
<img id="pagingtestid4" src="/arcinsys/images/aktion_last_w.png"/>  
                  
                </span>
</form>
</div>

I can tell that I am only on page 1 of 2, but how to get to page 2? I have managed to get the session and cookies:

session = requests.session()
cookies = r.cookies

print(session)
for cookie in r.cookies:
    print(cookie)

With this as the two results:

<requests.sessions.Session object at 0x0000017E20A7DAF0>
<Cookie JSESSIONID=A4FF49C5577C2A8EFCB0FCD6F2C2D181 for arcinsys.hessen.de/arcinsys>

I also have the referrer url in the html code above of

data-href="/arcinsys/recherchePaging.action?pagingvalues=2"

I have now tried various ways of passing either session id, cookies or referrer along, but nothing has been working so far. I might be doing it wrong and I am also not sure which way is best. Any help with this very much appreciated!

CodePudding user response：

Content is loaded dynamically via an additional request, as you can inspect in DevTools of your browser on the xhr request tab.

You could use a while-loop, select the element with id="pId3" what is the next page button and compare its data-href to the one of element with id="pId4" what is the last page button, if it is equal break your loop:

while True:
    soup = BeautifulSoup(s.get(url).text)

    ### extract what need

    if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
        url = baseUrl   soup.select_one('#pId3').get('data-href')
    else:
        break

Example

import requests
from bs4 import BeautifulSoup

s = requests.session()
s.headers = {'User-Agent': 'Mozilla/5.0'}
baseUrl = 'https://arcinsys.hessen.de'
url='https://arcinsys.hessen.de/arcinsys/einfachsuchen.action?pageName=einfachesuche&methodName=einfach&rechercheBean.defaultfield=&rechercheBean.defaultfield_widget=wort&rechercheBean.von=&rechercheBean.bis=&rechercheBean.einfacheSucheRadioName=alle&__checkbox_rechercheBean.hasdigi=true'

while True:
    soup = BeautifulSoup(s.get(url).text)
    
    print([s.text for s in soup.select('td.cell-signature')])
    
    if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
        url = baseUrl   soup.select_one('#pId3').get('data-href')
    else:
        break

Output

['HStAM, 17 d', 'HStAM, 340 Stölzel', 'AdJb, A 216, ...', 'ISG FFM, W2-7, 3200', 'UBA Ffm, Na 49, 116', 'UBA Ffm, Na 62, 335', 'ISG FFM, W2-7, 4150', 'ISG FFM, S3, 30135', 'HStAD, G 37, 4776', 'HStAM, 340 Grimm, Ms 272', 'HStAM, 340 von Schwertzell, 859 d', 'HStAD, G 15 Schotten, B 76', 'HStAM, 311/1, B 59', 'StadtA KS, P 1, 914', 'UBA Ffm, Na 67 , 190', 'LWV-Archiv, B 100-10, 531', 'ISG FFM, W2-7, 3201', 'ISG FFM, W2-7, 3202', 'ISG FFM, W2-7, 2340', 'ISG FFM, W2-7, 1121']
...