Home > Software engineering >  Extracting pagination number using BeautifulSoup
Extracting pagination number using BeautifulSoup

Time:12-25

I'm trying to extract the pagination number of a webpage and have tried several methods all to no avail;

What's the right method, and please provide an explanation as to why these following methods do not extract the information as requested:

First method:

for i in range(0, 48, 24):
    url = f'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=STATION^1712&maxPrice=500000&radius=0.5&sortType=10&propertyTypes=&mustHave=&dontShow=&index={i}&furnishTypes=&keywords='
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    page = soup.select('span[]')
    print(page)

returns:

[]
[]

I've also tried:

1. page = soup.find('span', {'data-bind':'text: total'})
2. page = soup.select("[class~=pagination-pageInfo]")

which returns nothing

page = soup.select('span', {'data-bind':'text: total'})

which returns a bunch of unnecessary things and not the pagination number.

How do I get the pagination number at the bottom? expected output:

1
2

CodePudding user response:

There is no pagination element in DOM tree you get because this data loads by Javascript. You have 2 options:

  1. You can use Selenium and do what you do (search element by span[] selector).
  2. You still can use requests for your purpose, because you can find all page data including pagination in the JSON at the bottom of page HTML. You can easily get it with regular expressions. Full code:
import json
import requests
import re

for i in range(0, 48, 24):
    url = f'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=STATION^1712&maxPrice=500000&radius=0.5&sortType=10&propertyTypes=&mustHave=&dontShow=&index={i}&furnishTypes=&keywords='
    r = requests.get(url)
    html = r.text
    full_data_json = json.loads(re.search(r'window\.jsonModel = (.*)</script>', html).group(1))
    print(full_data_json["pagination"]["page"])
  • Related