Home > Back-end >  Scrape all Glassdoor reviews of a firm
Scrape all Glassdoor reviews of a firm

Time:12-03

With the answer provided by @Driftr95 (Really big thanks), I manage to scrape. However, my final goal is to scrape the entire reviews of this firm. I tried doing this by manipulating his codes:

def extract(pg): 
    headers = {'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}
    url = f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{pg}.htm?filter.iso3Language=eng'
    # f'https://www.glassdoor.com/Reviews/Google-Engineering-Reviews-EI_IE9079.0,6_DEPT1007_IP{pg}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'

    r = requests.get(url, headers, timeout=(3.05, 27))
    soup = BeautifulSoup(r.content, 'html.parser')# this a soup function that retuen the whole html
    return soup

for j in range(1,21,10):
    for i in range(j 1,j 11,1): #3M: 4251 reviews
        soup = extract( f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{i}.htm?filter.iso3Language=eng')
        print(f' page {i}')
        for r in soup.select('li[id^="empReview_"]'):
            rDet = {'reviewId': r.get('id')}
            for sr in r.select(subRatSel):
                k = sr.select_one('div:first-of-type').get_text(' ').strip()
                sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
                rDet[f'[rating] {k}'] = sval
    
            for k, sel in refDict.items():
                sval = r.select_one(sel)
                if sval: sval = sval.get_text(' ').strip()
                rDet[k] = sval
    
            empRevs.append(rDet) 

New Issues I faced

  1. My for-loop result is that reviews turn out not in order as the website shown. Also, my for-loop result tends to have duplicates...I scraped 200 reviews in this case, and it turned out that only 170 unique reviews, while the website does not have duplicates (i.e., exactly 200 unique reviews).
  2. I need the total number of reviews (i.e., 4251 reviews) or total number of pages (i.e., this one I am not sure how to do, but I observe that each page contains about 10 reviews, so should be at most 426 pages) of 3M. Is there a simple way to do it with the combination of my and @Driftr95's codes? Https tends to time out after 100 rounds...
  3. Also, in the case where not all the subratings are always available, see this example where there are only 4 subratings instead of 6 types of subratings, @Driftr95 does not work. All four subratings will turn out to be N.A.

I suspect I might need Selenium to resolve the issue...

Deeply thank you so much for all the help offered in the comment section!

CodePudding user response:

All four subratings will turn out to be N.A.

there were some things that I didn't account for because I hadn't encountered them before, but the updated version of getDECstars shouldn't have that issue. (If you use the longer version with argument isv=True, it's easier to debug and figure out what's missing from the code...)


I scraped 200 reviews in this case, and it turned out that only 170 unique reviews

Duplicates are fairly easy to avoid by maintaining a list of reviewIds that have already been added and checking against it before adding a new review to empRevs

scrapedIds = []
# for...
    # for ###
        # soup = extract...

        # for r in ...
            if r.get('id') in scrapedIds: continue # skip duplicate
            ## rDet = ..... ## AND REST OF INNER FOR-LOOP ##

            empRevs.append(rDet) 
            scrapedIds.append(rDet['reviewId']) # add to list of ids to check against

Https tends to time out after 100 rounds...

You could try adding breaks and switching out user-agents every 50 [or 5 or 10 or...] requests, but I'm quick to resort to selenium at times like this; this is my suggested solution - if you just call it like this and pass a url to start with:

## PASTE [OR DOWNLOAD&IMPORT] from https://pastebin.com/RsFHWNnt ##

startUrl = 'https://www.glassdoor.com/Reviews/3M-Reviews-E446.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
scrape_gdRevs(startUrl, 'empRevs_3M.csv', maxScrapes=1000, constBreak=False)

[last 3 lines of] printed output:

 total reviews:  4252
total reviews scraped this run: 4252
total reviews scraped over all time: 4252

It clicks through the pages until it reaches the last page (or maxes out maxScrapes). You do have to log in at the beginning though, so fill out login_to_gd with your username and password or log in manually by replacing the login_to_gd(driverG) line with the input(...) line that waits for you to login [then press ENTER in the terminal] before continuing.

I think cookies can also be used instead (with requests), but I'm not good at handling that. If you figure it out, then you can use some version of linkToSoup or your extract(pg); then, you'll have to comment out or remove the lines ending in ## for selenium and uncomment [or follow instructions from] the lines that end with ## without selenium. [But please note that I've only fully tested the selenium version.]

The CSVs [like "empRevs_3M.csv" and "scrapeLogs_empRevs_3M.csv" in this example] are updated after every page-scrape, so even if the program crashes [or you decide to interrupt it], it will have saved upto the previous scrape. Since it also tries to load form the CSVs at the beginning, you can just continue it later (just set startUrl to the url of the page you want to continue from - but even if it's at page 1, remember that duplicates will be ignored, so it's okay - it'll just waste some time though).

  • Related