I'm trying to scrape yellow pages, my code is stuck in taking the first business of each page but skips every other business on the page. Ex. 1st company of page 1, 1st company of page2 etc. I have no clue why it isn't iterating first through the 'web_page' variable, then checking for additional pages and thirdly looking for closing statement and executing ´break´. If anyone can provide me with clues or help it would be highly appreciated!
web_page_results = []
def yellow_pages_scraper(search_term, location):
page = 1
while True:
url = f'https://www.yellowpages.com/search?search_terms={search_term}&geo_location_terms={location}&page={page}'
r = requests.get(url, headers = headers)
soup = bs(r.content, 'html.parser')
web_page = soup.find_all('div', {'class':'search-results organic'})
for business in web_page:
business_dict = {}
try:
business_dict['name'] = business.find('a', {'class':'business-name'}).text
print(f'{business_dict["name"]}')
except AttributeError:
business_dict['name'] = ''
try:
business_dict['street_address'] = business.find('div', {'class':'street-address'}).text
except AttributeError:
business_dict['street_address'] = ''
try:
business_dict['locality'] = business.find('div', {'class':'locality'}).text
except AttributeError:
business_dict['locality'] = ''
try:
business_dict['phone'] = business.find('div', {'class':'phones phone primary'}).text
except AttributeError:
business_dict['phone'] = ''
try:
business_dict['website'] = business.find('a', {'class':'track-visit-website'})['href']
except AttributeError:
business_dict['website'] = ''
try:
web_page_results.append(business_dict)
print(web_page_results)
except:
print('saving not working')
# If the last iterated page doesn't find the "next page" button, break the loop and return the list
if not soup.find('a', {'class': 'next ajax-page'}):
break
page = 1
return web_page_results
CodePudding user response:
It's worth looking at this line;
web_page = soup.find_all('div', {'class':'search-results organic'})
When I go to the request url I can only find one instance of search-results organic
on the page. You then go and iterate over the list (web_page), but there will only be 1 value in the list. So when you do the for loop;
for business in web_page:
you will always only do it once, due to the single item in the list and therefore only get the first result on the page.
You need to loop through the list of businesses on the page not the container holding the business listings. I recommend creating a list from class='srp-listing'
:
web_page = soup.find_all('div', {'class':'srp-listing'})
This should give you a list of all the businesses on the page. When you iterate over the new list of businesses you will go through more than just the one listing.