I am trying to scrape some data from Glassdoor.com for a project. This is the code I have so far so scrape it:
## Load libraries
library(httr)
library(xml2)
library(rvest)
library(purrr)
library(tidyverse)
library(lubridate)
# URLS for scraping
start_url <- "https://www.glassdoor.co.uk/Reviews/Company-Reviews-"
settings_url <- ".htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
### Scrape Reviews
map_df(1:1, function(i){
Sys.sleep(3)
tryCatch({
pg_reviews <- read_html(GET(paste(start_url, "E8450", "_P", i, settings_url, sep = "")))
table = pg_reviews %>%
html_elements(".mb-0")
data.frame(date = pg_reviews %>%
html_elements(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text2(),
summary = pg_reviews %>%
html_elements(".reviewLink") %>%
html_text(),
rating = pg_reviews %>%
html_elements("#ReviewsFeed .mr-xsm") %>%
html_text(),
employee_type = pg_reviews %>%
html_elements(".eg4psks0") %>%
html_text(),
pros = pg_reviews %>%
html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
html_text(),
cons = pg_reviews %>%
html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(2) span") %>%
html_text()
)}, error = function(e){
NULL
})
}) -> reviews_df
Until here everything works fine. However, I would also like to scrape the individual ratings on some of the reviews: picture
But I am really struggling to find the specific element referring to those ratings. I would love to suggest my take but I am completely lost on this one. I have been tried with SelectorGadget and also by inspecting the page but I cannot seem to manage.
Any suggestions?
CodePudding user response:
Locating the data
Inspecting the stars in those ratings, shows they are in the following HTML structure:
...
<div >
<ul >
<li>
<div>Work/Life Balance</div>
<div font-size="sm" >
<span color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
<span color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
<span color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
<span color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
<span color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
</div>
</li>
<li>
<div>Culture & Values</div>
<div font-size="sm" >
...
CSS defines how many of them are colored, through the class of the div immediately under 'Work/Life Balance', eg:
<div font-size="sm" >
We find the corresponding CSS elsewhere in the document:
<style data-emotion-css="18v8tui">
.css-18v8tui {
display: inline-block;
line-height: 1;
background: linear-gradient(90deg, #0caa41 40%, #dee0e3 40%);
-webkit-letter-spacing: 3px;
-moz-letter-spacing: 3px;
-ms-letter-spacing: 3px;
letter-spacing: 3px;
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
}
</style>
The 40%
in the background
value sets 40% of the div to yellow, making 40% of the stars light up in this example.
Extracting the data
First we load the page
url = "https://www.glassdoor.co.uk/Reviews/PwC-Reviews-E8450.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
pg_reviews = read_html(url)
Then we extract all <style>
elements, each one containing a single class in this case. We take any ...% value in the CSS class, and divide it by 20 to convert from a percentage to a number of stars. We save this number of stars in a named vector, where the name of each field is the name of the corresponding CSS class. This will allows us to correlate a rating's CSS class to a number of stars.
class.ratings = c()
styles = pg_reviews %>% html_elements('style')
for(s in styles) {
class = s %>% html_attr('data-emotion-css')
class = paste0('css-', class)
rating = str_match(s %>% html_text2(), '(\\d )%')[2]
class.ratings[class] = as.numeric(rating)/20
}
> class.ratings
css-animation-1fypb1g css-197m635 css-67i7qe
NA 5.0 5.0
css-3x0lbp css-hdvrkk css-8hewl0
NA 5.0 5.0
css-1x8evti css-1ohf0ui css-1htgz7a
NA NA 5.0
...
Not every percentage that we found really correlates to a star-rating, but that's okay.
Finally we grab all reviews, each in an element with class gdReview
. For each review we grab all star-ratings, each in an element with class content
, in a li
element. For each star-rating we extract the text label and the CSS class for the number of stars. I don't do anything to export the results, just output them to the console:
reviews = pg_reviews %>% html_elements('.gdReview')
for(re in reviews) {
ratings = re %>% html_elements('.content') %>% html_elements('li')
for(ra in ratings) {
label = ra %>% html_element('div') %>% html_text()
classes = ra %>% html_elements('div[font-size="sm"]') %>% html_attr('class')
class = str_split(classes, ' ')[[1]][1] # take the first class attribute
cat(label, class.ratings[class], '\n')
}
cat('\n')
}
output:
Work/Life Balance 5
Culture & Values 5
Diversity & Inclusion 5
Career Opportunities 5
...
Since not every review contains these star-ratings per subcategory, there will be some empty fields.