Home > Blockchain >  How to navigate to next page with session id and/or refererr while url remains static?
How to navigate to next page with session id and/or refererr while url remains static?

Time:09-04

I am trying to scrape a website and am successful for the first page. However, am I not managing to scrape the next page.

So far, I am using requests and BeautifulSoup and am getting the content from the first page by using the following code:

r = requests.get(url)
data = soup(r.content, 'html.parser')

This returns some lovely html, the information about pages and referrer I get here are the following:

<div >
<form action="/arcinsys/recherchePagingSelect.action" id="headerPagingForm" method="get" name="headerPagingForm">
<input name="_csrf" type="hidden" value="45cc8dd5-2869-4327-957e-83ffcbe08fba"/>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId1">
<img id="pagingtestid1" src="/arcinsys/images/aktion_first_w.png"/>  
                  
                  
                  
                  
                </span>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId2">
<img id="pagingtestid2" src="/arcinsys/images/aktion_prev_w.png"/>  
                  
                  
                  
                </span>
<span ><input id="pageposition" maxlength="6" name="pageposition" size="6" style="width: 50px" type="text" value="1"/>
<button id="formSubmitButton2" title="Seite 1 von 2">  / 2 </button></span>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId3">
<img id="pagingtestid3" src="/arcinsys/images/aktion_next_w.png"/>  
                  
                  
                </span>
<span  data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId4">
<img id="pagingtestid4" src="/arcinsys/images/aktion_last_w.png"/>  
                  
                </span>
</form>
</div>

I can tell that I am only on page 1 of 2, but how to get to page 2? I have managed to get the session and cookies:

session = requests.session()
cookies = r.cookies

print(session)
for cookie in r.cookies:
    print(cookie)

With this as the two results:

<requests.sessions.Session object at 0x0000017E20A7DAF0>
<Cookie JSESSIONID=A4FF49C5577C2A8EFCB0FCD6F2C2D181 for arcinsys.hessen.de/arcinsys>

I also have the referrer url in the html code above of

data-href="/arcinsys/recherchePaging.action?pagingvalues=2"

I have now tried various ways of passing either session id, cookies or referrer along, but nothing has been working so far. I might be doing it wrong and I am also not sure which way is best. Any help with this very much appreciated!

CodePudding user response:

Content is loaded dynamically via an additional request, as you can inspect in DevTools of your browser on the xhr request tab.

You could use a while-loop, select the element with id="pId3" what is the next page button and compare its data-href to the one of element with id="pId4" what is the last page button, if it is equal break your loop:

while True:
    soup = BeautifulSoup(s.get(url).text)

    ### extract what need

    if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
        url = baseUrl   soup.select_one('#pId3').get('data-href')
    else:
        break

Example

import requests
from bs4 import BeautifulSoup

s = requests.session()
s.headers = {'User-Agent': 'Mozilla/5.0'}
baseUrl = 'https://arcinsys.hessen.de'
url='https://arcinsys.hessen.de/arcinsys/einfachsuchen.action?pageName=einfachesuche&methodName=einfach&rechercheBean.defaultfield=&rechercheBean.defaultfield_widget=wort&rechercheBean.von=&rechercheBean.bis=&rechercheBean.einfacheSucheRadioName=alle&__checkbox_rechercheBean.hasdigi=true'

while True:
    soup = BeautifulSoup(s.get(url).text)
    
    print([s.text for s in soup.select('td.cell-signature')])
    
    if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
        url = baseUrl   soup.select_one('#pId3').get('data-href')
    else:
        break

Output

['HStAM, 17 d', 'HStAM, 340 Stölzel', 'AdJb, A 216, ...', 'ISG FFM, W2-7, 3200', 'UBA Ffm, Na 49, 116', 'UBA Ffm, Na 62, 335', 'ISG FFM, W2-7, 4150', 'ISG FFM, S3, 30135', 'HStAD, G 37, 4776', 'HStAM, 340 Grimm, Ms 272', 'HStAM, 340 von Schwertzell, 859 d', 'HStAD, G 15 Schotten, B 76', 'HStAM, 311/1, B 59', 'StadtA KS, P 1, 914', 'UBA Ffm, Na 67 , 190', 'LWV-Archiv, B 100-10, 531', 'ISG FFM, W2-7, 3201', 'ISG FFM, W2-7, 3202', 'ISG FFM, W2-7, 2340', 'ISG FFM, W2-7, 1121']
...
  • Related