Home > Blockchain >  Scraping data using R and placing results in a data frame
Scraping data using R and placing results in a data frame

Time:06-08

I'm trying to scrape reviews from Glassdoor using Rvest and place the results in a data frame with one row per review. My code is below, but the section where I try to pull the sub-ratings (work-life balance, culture and values, etc) doesn't work. There are five different sub-ratings within a drop down, and one or more of them may be blank for each review. Here's my preliminary code. Do you have any suggestions for how I can pull the sub-ratings and put each sub-rating in a separate column in my data frame?

## Load libraries
library(httr)  
library(xml2)  
library(rvest) 
library(purrr) 
library(tidyverse)
library(lubridate)

## URL for scraping
url = "https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm"
pg_reviews = read_html(url)

##Create data frame and define features to scrape
Google_reviews = data.frame()

class.ratings = c()
styles = pg_reviews %>% html_elements('style')
for(s in styles) {
     class = s %>% html_attr('data-emotion-css')
     class = paste0('css-', class)
     rating = str_match(s %>% html_text2(), '(\\d )%')[2]
     class.ratings[class] = as.numeric(rating)/20
}

reviews = pg_reviews %>% html_elements('.gdReview')

summary = pg_reviews %>% 
     html_elements(".reviewLink") %>% 
     html_text()

rating = pg_reviews %>%
     html_elements("#ReviewsFeed .mr-xsm") %>%
     html_text()

pros = pg_reviews %>%
     html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
     html_text()

cons = pg_reviews %>%
     html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(2) span") %>%
     html_text()

#Subratings--DOES NOT WORK
for(re in reviews) {
     subratings = re %>% html_elements('.content') %>% html_elements('li')
     for(i = 1 to 5) {
          
          label = i %>% html_element('div') %>% html_text()
          classes = i %>% html_elements('div[font-size="sm"]') %>% html_attr('class')
          class = str_split(classes, ' ')[[1]][1] # take the first class attribute
          cat(class.ratings[class], ',')
          
     }
work_life_balance <- subratings(1)
culture_values <- subratings(2)
career_opportunities <- subratings(3)
comp_benefits <- subratings(4)
management <- subratings(5)



}


Google_reviews = rbind(Google_reviews,data.frame(summary,rating,pros,cons,work_life_balance,culture_values
                                                 career_opportunities,comp_benefits,management))
'''

CodePudding user response:

It was not trivial to obtain the sub rankings and parse into a dataframe.
See comments for details.

Updated

library(rvest)

url = "https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm"
pg_reviews = read_html(url)

library(stringr)
#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts<-pg_reviews %>% html_elements(xpath='//script')

#search the scripts for the ratings
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
#filter the script down to just the data.  This is JSON like haven't figured out the beginning or end
data <-scripts[ratingsScript] %>% html_text2() %>% str_extract("\"urlParams\":. \\}\\}\\}\\}") 


#extract the ratings
WorkLifeBalance  <- str_extract_all(data, '(?<="ratingWorkLifeBalance":)\\d') %>% unlist() %>% as.integer()
CultureAndValues <- str_extract_all(data, '(?<="ratingCultureAndValues":)\\d') %>% unlist() %>% as.integer()
DiversityAndInclusion        <- str_extract_all(data, '(?<="ratingDiversityAndInclusion":)\\d') %>% unlist() %>% as.integer()
SeniorLeadership <- str_extract_all(data, '(?<="ratingSeniorLeadership":)\\d') %>% unlist() %>% as.integer()
CareerOpportunities <- str_extract_all(data, '(?<="ratingCareerOpportunities":)\\d') %>% unlist() %>% as.integer()
CompensationAndBenefits<- str_extract_all(data, '(?<="ratingCompensationAndBenefits":)\\d') %>% unlist() %>% as.integer()

ratings <- cbind(WorkLifeBalance, CultureAndValues, DiversityAndInclusion, SeniorLeadership, CareerOpportunities, CompensationAndBenefits)

      WorkLifeBalance CultureAndValues DiversityAndInclusion SeniorLeadership CareerOpportunities CompensationAndBenefits
 [1,]               2                4                     2                4                   5                       4
 [2,]               2                3                     0                3                   3                       5
 [3,]               5                4                     0                4                   5                       5
 [4,]               5                5                     5                5                   5                       5
 [5,]               0                0                     0                0                   1                       0
 [6,]               5                5                     5                5                   5                       5
 [7,]               0                0                     0                0                   0                       0
 [8,]               0                0                     0                0                   0                       0
 [9,]               0                0                     0                0                   0                       0
[10,]               0                0                     0                0                   0                       0

All of the information associated with the reviews should be stored in the "data" variable. This is appears to be JSON, but I can't determine the correct start and stopping points, thus the need to manually extract the ratings.
The last line will provide a data frame with 1 row per review and a column for each of the different categories in the sub rankings. You may want to convert the 0 to NA. You can cbind() this to your "Google_reviews" data frame.

  • Related