I am trying to paginate a scraper on my my university's website. Here is the url for one of the pages:
https://www.bu.edu/com/profile/david-abel/
where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:
How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above
import requests
from bs4 import BeautifulSoup
url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)
my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []
html = BeautifulSoup(data.text, 'html.parser')
professors = html.select('h4.profile-card__name')
for professor in professors:
my_data.append(professor.text)
for name in my_data:
x = name.split()
split_names.append(x)
for name in split_names:
f, l = zip(*split_names)
firstnames.append(f)
lastnames.append(l)
#\/ appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl
print(firstnames)
print(lastnames)
CodePudding user response:
This simple modification should give you what you want, let me know if you have any more questions or if anything needs to be changed!
# appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl "-".join(name)
print(newurl)
Even better:
for name in split_names:
profile_url = f"https://www.bu.edu/com/profile/{'-'.join(name)}"
print(profile_url)
As for the pagination part, this should work and is not hard coded. Let's say that new faculty join and there are now 9 pages. This code should still work in that case.
url = 'https://www.bu.edu/com/profiles/faculty/page'
with requests.get(f"{url}/1") as response:
soup = BeautifulSoup(response.text, 'html.parser')
# select pagination numbers shown ex: [2, 3, 7, Next] (Omit the next)
page_numbers = [int(n.text) for n in soup.select("a.page-numbers")[:-1]]
# take the min and max for pagination
start_page, stop_page = min(page_numbers), max(page_numbers) 1
# loop through pages
for page in range(start_page, stop_page):
with requests.get(f"{url}/{page}") as response:
soup = BeautifulSoup(response.text, 'html.parser')
professors = soup.select('h4.profile-card__name')
# ---
I believe this is the best and most concise way to solve your problem. Just as a tip you should use with
when making requests as it takes care of a lot of issues for you and you don't have to pollute the namespace with things like resp1
, resp2
, etc. Like mentions above, f-strings are amazing and super easy to use.