Paginating pages using things other than numbers in python-CodePudding

I am trying to paginate a scraper on my my university's website. Here is the url for one of the pages:

https://www.bu.edu/com/profile/david-abel/

where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:

How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above

import requests
from bs4 import BeautifulSoup

url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []


html = BeautifulSoup(data.text, 'html.parser')

professors = html.select('h4.profile-card__name')

for professor in professors:
    my_data.append(professor.text)

for name in my_data:
    x = name.split()
    split_names.append(x)

for name in split_names:
    f, l = zip(*split_names)
    firstnames.append(f)
    lastnames.append(l)

#\/ appending searchable url using names
for name in split_names:
    baseurl = "https://www.bu.edu/com/profile/"
    newurl = baseurl   


print(firstnames)
print(lastnames)

CodePudding user response：

This simple modification should give you what you want, let me know if you have any more questions or if anything needs to be changed!

# appending searchable url using names
 for name in split_names:
   baseurl = "https://www.bu.edu/com/profile/"
   newurl =  baseurl   "-".join(name)
   print(newurl)

Even better:

  for name in split_names:
    profile_url = f"https://www.bu.edu/com/profile/{'-'.join(name)}"
    print(profile_url)

As for the pagination part, this should work and is not hard coded. Let's say that new faculty join and there are now 9 pages. This code should still work in that case.

url = 'https://www.bu.edu/com/profiles/faculty/page'
with requests.get(f"{url}/1") as response:
  soup = BeautifulSoup(response.text, 'html.parser')
  # select pagination numbers shown ex: [2, 3, 7, Next] (Omit the next)
  page_numbers = [int(n.text) for n in soup.select("a.page-numbers")[:-1]]
  # take the min and max for pagination
  start_page, stop_page = min(page_numbers), max(page_numbers)   1

# loop through pages
for page in range(start_page, stop_page):
  with requests.get(f"{url}/{page}") as response:
    soup = BeautifulSoup(response.text, 'html.parser')
    professors = soup.select('h4.profile-card__name')
# ---

I believe this is the best and most concise way to solve your problem. Just as a tip you should use with when making requests as it takes care of a lot of issues for you and you don't have to pollute the namespace with things like resp1, resp2, etc. Like mentions above, f-strings are amazing and super easy to use.