I am trying to scrape user reviews from a web site. Some of the reviews do not have body text so I am left with vectors of different lengths and getting the "arguments imply differing number of rows: 20, 19" error (20 is correct) when trying to combine the scraped datetime, rating, and review results into a data frame.
I have looked at the solution here which uses !nzchar to perform a replacement if the length of an html node is zero. This would seem to be a good solution for me but I can't get the code to insert a value into the vector to make the length correct. My code to scrape the node that contains an empty value is:
library(rvest)
library(tidyverse)
library(stringr)
url <- "http://www.trustpilot.com/review/www.amazon.com?page=2"
working_page <- read_html(url)
working_reviews <- working_page %>%
html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
html_text(trim=TRUE) %>%
replace(!nzchar(.), NA) %>%
str_trim() %>%
unlist()
length(working_reviews)
[1] 19
This returns a vector of 19 values; my expected output is a vector of 20 values, with 'NA' filling those values for which there isn't a review body. On this particular page, the 17th review contains no body text.
Desired result:
working_reviews[1]
[1] "I placed an order w/Amazon and selected the 18 payment plan. Amazon charged the entire amount to my card. Called them and got no where. I was told it was the banks fault and I had to take it up with them.Buyer be ware!!!"
working_reviews[17]
[17] "NA"
I have also tried using the following line to "force" insert a string into the empty review:
working_reviews <- working_page %>%
html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
html_text(trim=TRUE) %>%
replace(!nzchar(.), "No review") %>%
str_trim() %>%
unlist()
This produces the same result with a length of 19 and does not include an element containing "No review".
I also tried inverting the nzchar code as a test, removing the '!' and got back a 19-element vector with "NA" for every element.
CodePudding user response:
Neatly into a tibble and returns NA
if the review is missing.
library(tidyverse)
library(rvest)
page <-
"https://www.trustpilot.com/review/www.amazon.com?page=2" %>%
read_html()
tibble(
name = page %>%
html_elements(".styles_consumerName__dP8Um") %>%
html_text2(),
rating = page %>%
html_elements(".styles_reviewHeader__iU9Px img") %>%
html_attr("alt") %>%
parse_number(),
title = page %>%
html_elements(".link_notUnderlined__szqki.typography_color-inherit__TlgPO") %>%
html_text2(),
review = page %>%
html_elements(".styles_reviewCard__hcAvl") %>%
map(. %>%
html_element(".typography_body__9UBeQ") %>%
html_text2) %>%
unlist()
)
# A tibble: 20 x 4
name rating title review
<chr> <dbl> <chr> <chr>
1 Octo Cavazos 1 I placed an order w/Amazon and sel~ "I pl~
2 Jeffrey Hayes 1 Don't waste your time,energy or mo~ "Don'~
3 Andy Here 1 Over the pandemic "Over~
4 Lorna Mills 1 Customer service "I or~
5 Daniel Sthamer 1 Prime delivery isn't worth it anym~ "Amaz~
6 Carolyn 2 Amzon delivery is not worth the pr~ "Amaz~
7 BruceW 5 “We apologize but Amazon has notic~ "“We ~
8 Matthew Smego 1 Aweful "Almo~
9 goku 1 Prime membership traps… "They~
10 Antoinette Barnett 2 Customer loyalty and/or history ar~ "Been~
11 AC 1 Amazon has gone to sh** "Amaz~
12 customer 1 so I ask for a refund back to my a~ "so I~
13 Will Chen 1 Rude and stupid customer service "If p~
14 Matthew Blevins 1 Amazon Claims They Did Not Receive~ "I us~
15 Gem 1 Ordered puppy food Monday received… "Orde~
16 SuzyJ 1 On August 9 2022 it will have be t~ "On A~
17 Isabelle 1 Item arrived poorly packed and dam~ NA
18 Hannah veibel 1 no Money returned "I or~
19 DiConti Jenine 1 Amazon is a fraudulent company. "Amaz~
20 Urvashi 1 Only Buyer oriented marketplace "Does~