I created a data frame of reviews from a website. The three columns are date, rating, and text. I want to only see 1 and 5 star reviews. I have tried everything below and get roughly the same error
df %>% filter(Rating = '1 star', Rating = '5 star')
df$Rating
[1] Date Rating Text
<0 rows> (or 0-length row.names)
None have worked. Here's the full code. The bit with the df is at the very bottom:
library(rvest)
library(tidyverse)
# Create url object ---------------------------------
url = "https://www.yelp.com/biz/24th-st-pizzeria-san-antonio?osq=Worst Restaurant"
# Convert url to html object ------------------------
page <- read_html(url)
# Number of pages -----------------------------------
pageNums = page %>%
html_elements(xpath = "//div[@class=' border-color--default__09f24__NPAKY text-align--center__09f24__fYBGO']") %>%
html_text() %>%
str_extract('of.*') %>%
str_remove('of ') %>%
as.numeric()
# Create page sequence ------------------------------
pageSequence <- seq(from=0, to=(pageNums * 10)-10, by = 10)
# Create empty vectors to store data ----------------
review_date_all = c()
review_rating_all = c()
review_text_all = c()
# Create for loop -----------------------------------
for (i in pageSequence){
if (i==0){
page <- read_html(url)
} else {
page <- read_html(paste0(url, '&start=', i))
}
# Review date ----
review_dates <- page %>%
html_elements(xpath = "//*[@class=' css-chan6m']") %>%
html_text() %>%
.[str_detect(., "^\\d [/]\\d [/]\\d{4}$")]
# Review Rating ----
review_ratings <- page %>%
html_elements(xpath = "//div[starts-with(@class, ' review')]") %>%
html_elements(xpath = ".//div[contains(@aria-label, 'rating')]") %>%
html_attr('aria-label') %>%
str_remove('rating')
# Review text ----
review_text = page %>%
html_elements(xpath = "//p[starts-with(@class, 'comment')]") %>%
html_text()
# For each page, append these to appropriate vectors----
review_date_all = append(review_date_all, review_dates)
review_rating_all = append(review_rating_all, review_ratings)
review_text_all = append(review_text_all, review_text)
}
# Create data frame ---------------------------------
df <- data.frame('Date' = review_date_all,
'Rating' = review_rating_all,
'Text'= review_text_all)
View(df)
What am I overlooking?
CodePudding user response:
There's an issue with the Rating
values in your df
. There's an extra space at the end of every rating.
So you need to do something like this:
df1 <- df %>%
filter(Rating == '1 star ' | Rating == '5 star ')
You can also remove the trailing whitespace using stringr
library as follows:
library(stringr)
df1 <- df %>%
mutate(Rating = str_squish(Rating)) %>%
filter(Rating == '1 star' | Rating == '5 star')