Home > Software engineering >  Python3 script skips pages when scraping a website with beautifulsoup
Python3 script skips pages when scraping a website with beautifulsoup

Time:12-21

I am trying to scrape Glassdoor's reviews on Microsoft using Python3 and Beautifulsoup. While the code works as intended, at least partially, it randomly skips some pages, and I cannot figure out why. My code looks like this:

from bs4 import BeautifulSoup
import time
import csv

# Set a counter
i=1

# specify the URL of the website you want to scrape
url = "https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P" str(i) ".htm?filter.iso3Language=eng"
while True:
    i = i 1
    page = requests.get(url)
    # if page.status_code != 200:
    #   break
    url = "https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P" str(i) ".htm?filter.iso3Language=eng"
    # make a GET request to the website and retrieve the HTML content
    response = requests.get(url)
    time.sleep(0.5)
    html = response.content
    
    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")
    
    for category, match in zip(soup.find_all("p", class_="mb-0 strong"), 
                            soup.find_all("p", class_="mt-0 mb-0 pb v2__EIReviewDetailsV2__bodyColor v2__EIReviewDetailsV2__lineHeightLarge v2__EIReviewDetailsV2__isExpanded")):
        reviews = match.span.text
        proscons = category.text

        print(proscons)
        print(reviews)
        print(i)
        print()

    if i>10:
        break

And the output looks like this:

Pros
Respect for employee needs, holidays generally calm and time off respected
3

[Skipped page 2, but all is as expected until page 4]

Cons
The Tech stack is narrow. Limited career opportunities.
4

Pros
The culture is VERY good
7

[Pages 5 and 6 were also skipped]

The behaviour seems to be completely random, as when I re-run the same code, different pages are parsed, while others are skipped.

Many thanks in advance for your help!

CodePudding user response:

As it's a bit unclear for me what data value do you want to scrape? But you can try the next example whether it meets your expectation .

Code:

from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0'}

data = []
for page in range(1,51):
    res = requests.get(f"https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P{page}.htm?filter.iso3Language=eng", headers = headers)
    #print(res)
    soup = BeautifulSoup(res.content, "html.parser")

    for review in soup.select('div.gdReview'):
       
        data.append({
            "review_title": review.select_one('h2[] > a').get_text(strip=True),
            "pros": review.select_one('span[data-test="pros"]').text,
            'cons': review.select_one('span[data-test="cons"]').text,
            "review_decision": review.select_one('div[]').text

        })

df = pd.DataFrame(data).to_csv('out.csv', index=False)
#print(df)

Output:

            review_title  ...                           review_decision
0    Great company to work with to grow your archit...  ...  Be the first to find this review helpful
1                          Thoughts after 10 years....  ...     2172 people found this review helpful
2                                        Great company  ...  Be the first to find this review helpful
3                                        Great company  ...  Be the first to find this review helpful
4                          Fair employment environment  ...  Be the first to find this review helpful
..                                                 ...  ...                                       ...
495                                  Microsoft reviews  ...  Be the first to find this review helpful
496  Good place to coast; annoying place for engine...  ...  Be the first to find this review helpful
497                                            Not bad  ...  Be the first to find this review helpful
498                        Great Company/Great Culture  ...  Be the first to find this review helpful
499                                   Liked everything  ...  Be the first to find this review helpful

[500 rows x 4 columns]
  • Related