Home > Back-end >  I cannot scrape one element of Glassdoor.com using R
I cannot scrape one element of Glassdoor.com using R

Time:05-04

I am trying to scrape some data from Glassdoor.com for a project. This is the code I have so far so scrape it:

## Load libraries
library(httr)  
library(xml2)  
library(rvest) 
library(purrr) 
library(tidyverse)
library(lubridate)

  # URLS for scraping
  start_url <- "https://www.glassdoor.co.uk/Reviews/Company-Reviews-"
  settings_url <- ".htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
  
  
  ### Scrape Reviews
  map_df(1:1, function(i){
    Sys.sleep(3)
    tryCatch({
      pg_reviews <- read_html(GET(paste(start_url, "E8450", "_P", i, settings_url, sep = "")))
      table = pg_reviews %>% 
        html_elements(".mb-0")
      
      data.frame(date = pg_reviews %>% 
                   html_elements(".middle.common__EiReviewDetailsStyle__newGrey") %>% 
                   html_text2(),
                 
                 summary = pg_reviews %>% 
                   html_elements(".reviewLink") %>% 
                   html_text(),
                 
                 rating = pg_reviews %>%
                   html_elements("#ReviewsFeed .mr-xsm") %>%
                   html_text(),
                 
                 employee_type = pg_reviews %>%
                   html_elements(".eg4psks0") %>%
                   html_text(),
                 
                 pros = pg_reviews %>%
                   html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
                   html_text(),
                 
                 cons = pg_reviews %>%
                   html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(2) span") %>%
                   html_text()
                 
      )}, error = function(e){
        NULL
      })
    
  }) -> reviews_df

Until here everything works fine. However, I would also like to scrape the individual ratings on some of the reviews: picture

But I am really struggling to find the specific element referring to those ratings. I would love to suggest my take but I am completely lost on this one. I have been tried with SelectorGadget and also by inspecting the page but I cannot seem to manage.

Any suggestions?

CodePudding user response:

Locating the data

Inspecting the stars in those ratings, shows they are in the following HTML structure:

...
<div >
  <ul >
    <li>
      <div>Work/Life Balance</div>
      <div font-size="sm" >
        <span  color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
        <span  color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
        <span  color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
        <span  color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
        <span  color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
      </div>
    </li>
    <li>
      <div>Culture &amp; Values</div>
      <div font-size="sm" >
        ...
        

CSS defines how many of them are colored, through the class of the div immediately under 'Work/Life Balance', eg:

<div font-size="sm" >

We find the corresponding CSS elsewhere in the document:

<style data-emotion-css="18v8tui">
  .css-18v8tui {
    display: inline-block;
    line-height: 1;
    background: linear-gradient(90deg, #0caa41 40%, #dee0e3 40%);
    -webkit-letter-spacing: 3px;
    -moz-letter-spacing: 3px;
    -ms-letter-spacing: 3px;
    letter-spacing: 3px;
    -webkit-background-clip: text;
    -webkit-text-fill-color: transparent;                                                                                                                        
  }
</style>

The 40% in the background value sets 40% of the div to yellow, making 40% of the stars light up in this example.

Extracting the data

First we load the page

url = "https://www.glassdoor.co.uk/Reviews/PwC-Reviews-E8450.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
pg_reviews = read_html(url)

Then we extract all <style> elements, each one containing a single class in this case. We take any ...% value in the CSS class, and divide it by 20 to convert from a percentage to a number of stars. We save this number of stars in a named vector, where the name of each field is the name of the corresponding CSS class. This will allows us to correlate a rating's CSS class to a number of stars.

class.ratings = c()
styles = pg_reviews %>% html_elements('style')
for(s in styles) {
  class = s %>% html_attr('data-emotion-css')
  class = paste0('css-', class)
  rating = str_match(s %>% html_text2(), '(\\d )%')[2]
  class.ratings[class] = as.numeric(rating)/20
}

> class.ratings
css-animation-1fypb1g           css-197m635            css-67i7qe 
                   NA                   5.0                   5.0 
           css-3x0lbp            css-hdvrkk            css-8hewl0 
                   NA                   5.0                   5.0 
          css-1x8evti           css-1ohf0ui           css-1htgz7a 
                   NA                    NA                   5.0 
 ...

Not every percentage that we found really correlates to a star-rating, but that's okay.

Finally we grab all reviews, each in an element with class gdReview. For each review we grab all star-ratings, each in an element with class content, in a li element. For each star-rating we extract the text label and the CSS class for the number of stars. I don't do anything to export the results, just output them to the console:

reviews = pg_reviews %>% html_elements('.gdReview')
for(re in reviews) {
  
  ratings = re %>% html_elements('.content') %>% html_elements('li')
  for(ra in ratings) {
    
    label = ra %>% html_element('div') %>% html_text()
    classes = ra %>% html_elements('div[font-size="sm"]') %>% html_attr('class')
    class = str_split(classes, ' ')[[1]][1] # take the first class attribute
    
    cat(label, class.ratings[class], '\n')
    
  }

  cat('\n')
  
}

output:

Work/Life Balance 5 
Culture & Values 5 
Diversity & Inclusion 5 
Career Opportunities 5 
...

Since not every review contains these star-ratings per subcategory, there will be some empty fields.

  • Related