Unable to filter data frame by column value-CodePudding

I created a data frame of reviews from a website. The three columns are date, rating, and text. I want to only see 1 and 5 star reviews. I have tried everything below and get roughly the same error

df %>% filter(Rating = '1 star', Rating = '5 star')

df$Rating

[1] Date   Rating Text  
<0 rows> (or 0-length row.names)

None have worked. Here's the full code. The bit with the df is at the very bottom:

library(rvest)
library(tidyverse)

# Create url object ---------------------------------
url = "https://www.yelp.com/biz/24th-st-pizzeria-san-antonio?osq=Worst Restaurant"

# Convert url to html object ------------------------
page <- read_html(url)

# Number of pages -----------------------------------
pageNums = page %>%
  html_elements(xpath = "//div[@class=' border-color--default__09f24__NPAKY text-align--center__09f24__fYBGO']") %>%
  html_text() %>%
  str_extract('of.*') %>% 
  str_remove('of ') %>% 
  as.numeric() 

# Create page sequence ------------------------------
pageSequence <- seq(from=0, to=(pageNums * 10)-10, by = 10)

# Create empty vectors to store data ----------------
review_date_all = c()
review_rating_all = c()
review_text_all = c()

# Create for loop -----------------------------------
for (i in pageSequence){
  if (i==0){
    page <- read_html(url) 
  } else {
    page <- read_html(paste0(url, '&start=', i))
  }
  
  # Review date ----
  review_dates <- page %>%
    html_elements(xpath = "//*[@class=' css-chan6m']") %>%
    html_text() %>%
    .[str_detect(., "^\\d [/]\\d [/]\\d{4}$")]
  
  # Review Rating ----
  review_ratings <- page %>%
    html_elements(xpath = "//div[starts-with(@class, ' review')]") %>%
    html_elements(xpath = ".//div[contains(@aria-label, 'rating')]") %>%
    html_attr('aria-label') %>%
    str_remove('rating')
  
  # Review text ----
  review_text = page %>%
    html_elements(xpath = "//p[starts-with(@class, 'comment')]") %>%
    html_text()
  
  # For each page, append these to appropriate vectors----
  review_date_all = append(review_date_all, review_dates)
  review_rating_all = append(review_rating_all, review_ratings)
  review_text_all = append(review_text_all, review_text)
}

# Create data frame ---------------------------------
df <- data.frame('Date' = review_date_all,
                 'Rating' = review_rating_all,
                 'Text'= review_text_all)
View(df)

What am I overlooking?

CodePudding user response：

There's an issue with the Rating values in your df. There's an extra space at the end of every rating.

So you need to do something like this:

df1 <- df %>%
  filter(Rating == '1 star ' | Rating == '5 star ')

You can also remove the trailing whitespace using stringr library as follows:

library(stringr)
df1 <- df %>%
  mutate(Rating = str_squish(Rating)) %>%
  filter(Rating == '1 star' | Rating == '5 star')