Home > Software design >  Filling empty values from web scraping with a string (rvest)
Filling empty values from web scraping with a string (rvest)

Time:08-02

I am trying to scrape user reviews from a web site. Some of the reviews do not have body text so I am left with vectors of different lengths and getting the "arguments imply differing number of rows: 20, 19" error (20 is correct) when trying to combine the scraped datetime, rating, and review results into a data frame.

I have looked at the solution here which uses !nzchar to perform a replacement if the length of an html node is zero. This would seem to be a good solution for me but I can't get the code to insert a value into the vector to make the length correct. My code to scrape the node that contains an empty value is:

library(rvest)
library(tidyverse)
library(stringr)

url <- "http://www.trustpilot.com/review/www.amazon.com?page=2"
working_page <- read_html(url)

working_reviews <- working_page %>%
  html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
  html_text(trim=TRUE) %>%
  replace(!nzchar(.), NA) %>%
  str_trim() %>%
  unlist()

length(working_reviews)

[1] 19

This returns a vector of 19 values; my expected output is a vector of 20 values, with 'NA' filling those values for which there isn't a review body. On this particular page, the 17th review contains no body text.

Desired result:

working_reviews[1]

[1] "I placed an order w/Amazon and selected the 18 payment plan. Amazon charged the entire amount to my card. Called them and got no where. I was told it was the banks fault and I had to take it up with them.Buyer be ware!!!"

working_reviews[17]

[17] "NA"

I have also tried using the following line to "force" insert a string into the empty review:

working_reviews <- working_page %>%
  html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
  html_text(trim=TRUE) %>%
  replace(!nzchar(.), "No review") %>%
  str_trim() %>%
  unlist()

This produces the same result with a length of 19 and does not include an element containing "No review".

I also tried inverting the nzchar code as a test, removing the '!' and got back a 19-element vector with "NA" for every element.

CodePudding user response:

Neatly into a tibble and returns NA if the review is missing.

library(tidyverse)
library(rvest)

page <-
  "https://www.trustpilot.com/review/www.amazon.com?page=2" %>%
  read_html()

tibble(
  name = page %>%  
    html_elements(".styles_consumerName__dP8Um") %>% 
    html_text2(),
  rating = page %>% 
    html_elements(".styles_reviewHeader__iU9Px img") %>% 
    html_attr("alt") %>% 
    parse_number(),
  title = page %>% 
    html_elements(".link_notUnderlined__szqki.typography_color-inherit__TlgPO") %>% 
    html_text2(),
  review = page %>%
    html_elements(".styles_reviewCard__hcAvl") %>%
    map(. %>%
          html_element(".typography_body__9UBeQ") %>%
          html_text2) %>%
    unlist()
)

# A tibble: 20 x 4
   name               rating title                               review
   <chr>               <dbl> <chr>                               <chr> 
 1 Octo Cavazos            1 I placed an order w/Amazon and sel~ "I pl~
 2 Jeffrey Hayes           1 Don't waste your time,energy or mo~ "Don'~
 3 Andy Here               1 Over the pandemic                   "Over~
 4 Lorna Mills             1 Customer service                    "I or~
 5 Daniel Sthamer          1 Prime delivery isn't worth it anym~ "Amaz~
 6 Carolyn                 2 Amzon delivery is not worth the pr~ "Amaz~
 7 BruceW                  5 “We apologize but Amazon has notic~ "“We ~
 8 Matthew Smego           1 Aweful                              "Almo~
 9 goku                    1 Prime membership traps…             "They~
10 Antoinette Barnett      2 Customer loyalty and/or history ar~ "Been~
11 AC                      1 Amazon has gone to sh**             "Amaz~
12 customer                1 so I ask for a refund back to my a~ "so I~
13 Will Chen               1 Rude and stupid customer service    "If p~
14 Matthew Blevins         1 Amazon Claims They Did Not Receive~ "I us~
15 Gem                     1 Ordered puppy food Monday received… "Orde~
16 SuzyJ                   1 On August 9 2022 it will have be t~ "On A~
17 Isabelle                1 Item arrived poorly packed and dam~  NA   
18 Hannah veibel           1 no Money returned                   "I or~
19 DiConti Jenine          1 Amazon is a fraudulent company.     "Amaz~
20 Urvashi                 1 Only Buyer oriented marketplace     "Does~
  • Related