Home > Blockchain >  Unable to filter data frame by column value
Unable to filter data frame by column value

Time:03-18

I created a data frame of reviews from a website. The three columns are date, rating, and text. I want to only see 1 and 5 star reviews. I have tried everything below and get roughly the same error

df %>% filter(Rating = '1 star', Rating = '5 star')

df$Rating

[1] Date   Rating Text  
<0 rows> (or 0-length row.names)

None have worked. Here's the full code. The bit with the df is at the very bottom:

library(rvest)
library(tidyverse)

# Create url object ---------------------------------
url = "https://www.yelp.com/biz/24th-st-pizzeria-san-antonio?osq=Worst Restaurant"

# Convert url to html object ------------------------
page <- read_html(url)

# Number of pages -----------------------------------
pageNums = page %>%
  html_elements(xpath = "//div[@class=' border-color--default__09f24__NPAKY text-align--center__09f24__fYBGO']") %>%
  html_text() %>%
  str_extract('of.*') %>% 
  str_remove('of ') %>% 
  as.numeric() 

# Create page sequence ------------------------------
pageSequence <- seq(from=0, to=(pageNums * 10)-10, by = 10)

# Create empty vectors to store data ----------------
review_date_all = c()
review_rating_all = c()
review_text_all = c()

# Create for loop -----------------------------------
for (i in pageSequence){
  if (i==0){
    page <- read_html(url) 
  } else {
    page <- read_html(paste0(url, '&start=', i))
  }
  
  # Review date ----
  review_dates <- page %>%
    html_elements(xpath = "//*[@class=' css-chan6m']") %>%
    html_text() %>%
    .[str_detect(., "^\\d [/]\\d [/]\\d{4}$")]
  
  # Review Rating ----
  review_ratings <- page %>%
    html_elements(xpath = "//div[starts-with(@class, ' review')]") %>%
    html_elements(xpath = ".//div[contains(@aria-label, 'rating')]") %>%
    html_attr('aria-label') %>%
    str_remove('rating')
  
  # Review text ----
  review_text = page %>%
    html_elements(xpath = "//p[starts-with(@class, 'comment')]") %>%
    html_text()
  
  # For each page, append these to appropriate vectors----
  review_date_all = append(review_date_all, review_dates)
  review_rating_all = append(review_rating_all, review_ratings)
  review_text_all = append(review_text_all, review_text)
}

# Create data frame ---------------------------------
df <- data.frame('Date' = review_date_all,
                 'Rating' = review_rating_all,
                 'Text'= review_text_all)
View(df)

What am I overlooking?

CodePudding user response:

There's an issue with the Rating values in your df. There's an extra space at the end of every rating.

So you need to do something like this:

df1 <- df %>%
  filter(Rating == '1 star ' | Rating == '5 star ')

You can also remove the trailing whitespace using stringr library as follows:

library(stringr)
df1 <- df %>%
  mutate(Rating = str_squish(Rating)) %>%
  filter(Rating == '1 star' | Rating == '5 star')
  • Related