I'm trying to scrape reviews from Glassdoor using Rvest and place the results in a data frame with one row per review. My code is below, but the section where I try to pull the sub-ratings (work-life balance, culture and values, etc) doesn't work. There are five different sub-ratings within a drop down, and one or more of them may be blank for each review. Here's my preliminary code. Do you have any suggestions for how I can pull the sub-ratings and put each sub-rating in a separate column in my data frame?
## Load libraries
library(httr)
library(xml2)
library(rvest)
library(purrr)
library(tidyverse)
library(lubridate)
## URL for scraping
url = "https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm"
pg_reviews = read_html(url)
##Create data frame and define features to scrape
Google_reviews = data.frame()
class.ratings = c()
styles = pg_reviews %>% html_elements('style')
for(s in styles) {
class = s %>% html_attr('data-emotion-css')
class = paste0('css-', class)
rating = str_match(s %>% html_text2(), '(\\d )%')[2]
class.ratings[class] = as.numeric(rating)/20
}
reviews = pg_reviews %>% html_elements('.gdReview')
summary = pg_reviews %>%
html_elements(".reviewLink") %>%
html_text()
rating = pg_reviews %>%
html_elements("#ReviewsFeed .mr-xsm") %>%
html_text()
pros = pg_reviews %>%
html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
html_text()
cons = pg_reviews %>%
html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(2) span") %>%
html_text()
#Subratings--DOES NOT WORK
for(re in reviews) {
subratings = re %>% html_elements('.content') %>% html_elements('li')
for(i = 1 to 5) {
label = i %>% html_element('div') %>% html_text()
classes = i %>% html_elements('div[font-size="sm"]') %>% html_attr('class')
class = str_split(classes, ' ')[[1]][1] # take the first class attribute
cat(class.ratings[class], ',')
}
work_life_balance <- subratings(1)
culture_values <- subratings(2)
career_opportunities <- subratings(3)
comp_benefits <- subratings(4)
management <- subratings(5)
}
Google_reviews = rbind(Google_reviews,data.frame(summary,rating,pros,cons,work_life_balance,culture_values
career_opportunities,comp_benefits,management))
'''
CodePudding user response:
It was not trivial to obtain the sub rankings and parse into a dataframe.
See comments for details.
Updated
library(rvest)
url = "https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm"
pg_reviews = read_html(url)
library(stringr)
#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts<-pg_reviews %>% html_elements(xpath='//script')
#search the scripts for the ratings
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
#filter the script down to just the data. This is JSON like haven't figured out the beginning or end
data <-scripts[ratingsScript] %>% html_text2() %>% str_extract("\"urlParams\":. \\}\\}\\}\\}")
#extract the ratings
WorkLifeBalance <- str_extract_all(data, '(?<="ratingWorkLifeBalance":)\\d') %>% unlist() %>% as.integer()
CultureAndValues <- str_extract_all(data, '(?<="ratingCultureAndValues":)\\d') %>% unlist() %>% as.integer()
DiversityAndInclusion <- str_extract_all(data, '(?<="ratingDiversityAndInclusion":)\\d') %>% unlist() %>% as.integer()
SeniorLeadership <- str_extract_all(data, '(?<="ratingSeniorLeadership":)\\d') %>% unlist() %>% as.integer()
CareerOpportunities <- str_extract_all(data, '(?<="ratingCareerOpportunities":)\\d') %>% unlist() %>% as.integer()
CompensationAndBenefits<- str_extract_all(data, '(?<="ratingCompensationAndBenefits":)\\d') %>% unlist() %>% as.integer()
ratings <- cbind(WorkLifeBalance, CultureAndValues, DiversityAndInclusion, SeniorLeadership, CareerOpportunities, CompensationAndBenefits)
WorkLifeBalance CultureAndValues DiversityAndInclusion SeniorLeadership CareerOpportunities CompensationAndBenefits
[1,] 2 4 2 4 5 4
[2,] 2 3 0 3 3 5
[3,] 5 4 0 4 5 5
[4,] 5 5 5 5 5 5
[5,] 0 0 0 0 1 0
[6,] 5 5 5 5 5 5
[7,] 0 0 0 0 0 0
[8,] 0 0 0 0 0 0
[9,] 0 0 0 0 0 0
[10,] 0 0 0 0 0 0
All of the information associated with the reviews should be stored in the "data" variable. This is appears to be JSON, but I can't determine the correct start and stopping points, thus the need to manually extract the ratings.
The last line will provide a data frame with 1 row per review and a column for each of the different categories in the sub rankings. You may want to convert the 0 to NA. You can cbind()
this to your "Google_reviews" data frame.