Home > Blockchain >  How to find the total number of pages on a website with BeautifulSoup?
How to find the total number of pages on a website with BeautifulSoup?

Time:10-07

Context: I'm working on pagination of this website: https://skoodos.com/schools-in-uttarakhand. When I inspected, this website has no proper number of pages visible except the next button which is ?page=2 after the url. Also, searching for page-link gave me number 20 at the end. So I assumed that the total number of pages is 20, upon checking manually, I learnt that, there only exist 11 pages.

After many trials and errors, I finally decided to go with just the indexing from 0 until 12 (12 is excluded by python however).
What I want to know is that, how wold you go about figuring out the number of pages on a particular website that doesn't show the actual number of pages other than previous and next button and how can I optimize this in terms of the same?

Here's my solution to pagination. Any way to optimize this other than me manually finding the number of pages?

from myWork.commons import url_parser, write


def data_fetch(url):
    school_info = []

    for page_number in range(0, 4):

        next_web_page = url   f'?page={page_number}'
        soup = url_parser(next_web_page)
        search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')

        # rest of the code
    for page_number in range(4, 12):

        next_web_page = url   f'?page={page_number}'
        soup = url_parser(next_web_page)
        search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')

        # rest of the code


def main():
    url = "https://skoodos.com/schools-in-uttarakhand"
    data_fetch(url)


if __name__ == "__main__":
    main()

CodePudding user response:

There's a bit at the top that says "Showing the 217 results as per selected criteria". You can code to extract the number from that, then count the number number of results per page and divide by that to get the expected number of pages (don't forget to round up ).

If you want to double check, add more code to go to the calculated last page and

  • if there's no such page, keep decrementing the total and checking until you hit a page that exists
  • if there is such a page, but it has an active/enabled "Next" button, keep going to Next page until reaching the last (basically as you are now)

(Remember that the two listed above are contingencies and wouldn't be executed in an ideal scenario.)

So, just to find the number of pages, you could do something like:

import requests
from bs4 import BeautifulSoup
import math

def soupFromUrl(scrapeUrl):
  req = requests.get(scrapeUrl)
  if req.status_code == 200:
    return BeautifulSoup(req.text, 'html.parser')
  else:
    raise Exception(f'{req.reason} - failed to scrape {scrapeUrl}')


def getPageTotal(url):
  soup = soupFromUrl(url)

  #totalResults = int(soup.find('label').get_text().split('(')[-1].split(')')[0])  
  totalResults = int(soup.p.strong.get_text()) # both searches should work
  perPageResults = len(soup.select('.m-show')) #probably always 20

  print(f'{perPageResults} of {totalResults} results per page') 
  if not (perPageResults > 0 and totalResults > 0):
    return 0 
  
  lastPageNum = math.ceil(totalResults/perPageResults)

  # Contingencies - will hopefully never be needed
  lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
  if lpSoup.select('.m-show'): #page exists
    while lpSoup.select_one('a[rel="next"]'):
      nextLink = lpSoup.select_one('a[rel="next"]')['href']
      lastPageNum = int(nextLink.split('page=')[-1])
      lpSoup = soupFromUrl(nextLink)
  else: #page does not exist
    while not (lpSoup.select('.m-show') or lastPageNum < 1): 
      lastPageNum = lastPageNum - 1
      lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
  # end Contingencies section
  
  return lastPageNum

However, it looks like you only want the total pages in order to start the for-loop, but it's not even necessary to use a for-loop at all - a while-loop might be better:

def data_fetch(url):
  school_info = []  
  nextUrl = url
  while nextUrl:
    soup = soupFromUrl(nextUrl)

    #GET YOUR DATA FROM PAGE

    nextHL = soup.select_one('a[rel="next"]')
    nextUrl = nextHL.get('href') if nextHL else None 
  
  # code after fetching all pages' data

Although, you could still use for-loop if you had a maximum page number in mind:

def data_fetch(url, maxPages):
  school_info = []  

  for p in range(1, maxPages 1):
    soup = soupFromUrl(f'{url}?page={p}')
    if not soup.select('.m-show'):
      break

    #GET YOUR DATA FROM PAGE 
  
  # code after fetching all pages' data [upto max]

CodePudding user response:

Each of your pages (except the last one) will have an element like this:

<a  
href="https://skoodos.com/schools-in-uttarakhand?page=2" 
rel="next">Next »</a>

E.g. you can extract the link as follows (here for the first page):

link = soup.find('a', class_='page-link', href=True, rel='next')
print(link['href'])
https://skoodos.com/schools-in-uttarakhand?page=2

So, you could make your function recursive. E.g. use something like this:

import requests
from bs4 import BeautifulSoup

def data_fetch(url, results = list()):
    resp = requests.get(url)
    soup = BeautifulSoup(resp.content, 'lxml')
    
    search_results = soup.find('section', {'id': 'search-results'})\
        .find(class_='container').find(class_='row')
    results.append(search_results)
    
    link = soup.find('a', class_='page-link', href=True, rel='next')
    
    # link will be `None` for last page (i.e. `page=11`)
    if link:
        # just adding some prints to show progress of iteration
        if not 'page' in url:
            print('getting page: 1', end=', ')
        url = link['href']
        # subsequent page nums being retrieved
        print(f'{url.rsplit("=", maxsplit=1)[1]}', end=', ')
        
        # recursive call
        return data_fetch(url, results)
    else:
        # `page=11` with no link, we're done
        print('done')
        
    return results

url = 'https://skoodos.com/schools-in-uttarakhand'
data = data_fetch(url)

So, a call to this function will print progress as:

getting page: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, done

And you'll end up with data with 11x bs4.element.Tag, one for each page.

print(len(data))
11
print(set([type(d) for d in data]))
{<class 'bs4.element.Tag'>}

Good luck with extracting the required info; the site is very slow, and the HTML is particularly sloppy and inconsistent. (e.g. you're right to note there is a page-link elem, which suggests there are 20 pages. But its visibility is set to hidden, so apparently this is just a piece of deprecated/unused code.)

  • Related