Home > Net >  Paginating pages using things other than numbers in python
Paginating pages using things other than numbers in python

Time:12-30

I am trying to paginate a scraper on my my university's website. Here is the url for one of the pages:

https://www.bu.edu/com/profile/david-abel/

where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:

How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above

import requests
from bs4 import BeautifulSoup

url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []


html = BeautifulSoup(data.text, 'html.parser')

professors = html.select('h4.profile-card__name')

for professor in professors:
    my_data.append(professor.text)

for name in my_data:
    x = name.split()
    split_names.append(x)

for name in split_names:
    f, l = zip(*split_names)
    firstnames.append(f)
    lastnames.append(l)

#\/ appending searchable url using names
for name in split_names:
    baseurl = "https://www.bu.edu/com/profile/"
    newurl = baseurl   


print(firstnames)
print(lastnames)

CodePudding user response:

This simple modification should give you what you want, let me know if you have any more questions or if anything needs to be changed!

# appending searchable url using names
 for name in split_names:
   baseurl = "https://www.bu.edu/com/profile/"
   newurl =  baseurl   "-".join(name)
   print(newurl)

Even better:

  for name in split_names:
    profile_url = f"https://www.bu.edu/com/profile/{'-'.join(name)}"
    print(profile_url)

As for the pagination part, this should work and is not hard coded. Let's say that new faculty join and there are now 9 pages. This code should still work in that case.

url = 'https://www.bu.edu/com/profiles/faculty/page'
with requests.get(f"{url}/1") as response:
  soup = BeautifulSoup(response.text, 'html.parser')
  # select pagination numbers shown ex: [2, 3, 7, Next] (Omit the next)
  page_numbers = [int(n.text) for n in soup.select("a.page-numbers")[:-1]]
  # take the min and max for pagination
  start_page, stop_page = min(page_numbers), max(page_numbers)   1

# loop through pages
for page in range(start_page, stop_page):
  with requests.get(f"{url}/{page}") as response:
    soup = BeautifulSoup(response.text, 'html.parser')
    professors = soup.select('h4.profile-card__name')
# ---

I believe this is the best and most concise way to solve your problem. Just as a tip you should use with when making requests as it takes care of a lot of issues for you and you don't have to pollute the namespace with things like resp1, resp2, etc. Like mentions above, f-strings are amazing and super easy to use.

  • Related