Home > Software engineering >  Yellow Pages Python web scraping stuck on first iteration
Yellow Pages Python web scraping stuck on first iteration

Time:01-07

I'm trying to scrape yellow pages, my code is stuck in taking the first business of each page but skips every other business on the page. Ex. 1st company of page 1, 1st company of page2 etc. I have no clue why it isn't iterating first through the 'web_page' variable, then checking for additional pages and thirdly looking for closing statement and executing ´break´. If anyone can provide me with clues or help it would be highly appreciated!

web_page_results = []
def yellow_pages_scraper(search_term, location):
    page = 1
    while True:
        url = f'https://www.yellowpages.com/search?search_terms={search_term}&geo_location_terms={location}&page={page}'
        r = requests.get(url, headers = headers)
        soup = bs(r.content, 'html.parser')
        web_page = soup.find_all('div', {'class':'search-results organic'})
        for business in web_page:
            business_dict = {}
            try:
                business_dict['name'] = business.find('a', {'class':'business-name'}).text
                print(f'{business_dict["name"]}')
            except AttributeError:
                business_dict['name'] = ''
            try:
                business_dict['street_address'] = business.find('div', {'class':'street-address'}).text
            except AttributeError:
                business_dict['street_address'] = ''
            try:
                business_dict['locality'] = business.find('div', {'class':'locality'}).text
            except AttributeError:
                business_dict['locality'] = ''
            try:
                business_dict['phone'] = business.find('div', {'class':'phones phone primary'}).text
            except AttributeError:
                business_dict['phone'] = ''
            try:
                business_dict['website'] = business.find('a', {'class':'track-visit-website'})['href']
            except AttributeError:
                business_dict['website'] = ''
            try:
                web_page_results.append(business_dict)
                print(web_page_results)
            except:
                print('saving not working')
        
            # If the last iterated page doesn't find the "next page" button, break the loop and return the list
        if not soup.find('a', {'class': 'next ajax-page'}):
            break
        page  = 1

    return web_page_results

CodePudding user response:

It's worth looking at this line;

web_page = soup.find_all('div', {'class':'search-results organic'})

When I go to the request url I can only find one instance of search-results organic on the page. You then go and iterate over the list (web_page), but there will only be 1 value in the list. So when you do the for loop;

for business in web_page:

you will always only do it once, due to the single item in the list and therefore only get the first result on the page.

You need to loop through the list of businesses on the page not the container holding the business listings. I recommend creating a list from class='srp-listing':

web_page = soup.find_all('div', {'class':'srp-listing'})

This should give you a list of all the businesses on the page. When you iterate over the new list of businesses you will go through more than just the one listing.

  • Related