Home > Blockchain >  Rvest ignore certain nested elements
Rvest ignore certain nested elements

Time:04-12

I am trying to learn how to scrape and practicing on yelp. I ran into a problem on one of the pages that has an extra nested div here:

https://www.yelp.com/biz/funny-bbq-new-york-2?start=40

I need to ignore the Previous review. I want to ignore it somehow. Here is the code i wrote to scrape everything else i need. I also want make sure it gets ignored on any page that has more than one, though i have not seen any. The problem with the previous review is the first three objects dates, review_text and stars are of length 11 but the last three only end up with 10

review_top <- "https://www.yelp.com/biz/funny-bbq-new-york-2?start=40" %>%
        read_html() %>%
        html_elements("ul.undefined:nth-child(4)")

    reviews <- tibble(
        dates = review_top %>% html_elements(".margin-t1__09f24__w96jn") %>%
            html_text(),
        review_text = review_top %>% html_elements(".raw__09f24__T4Ezm") %>%
            html_text(),
        stars = review_top %>% html_elements(".i-stars__09f24__M1AR7") %>%
            html_attr("aria-label"),
        useful = review_top %>%
            html_elements(xpath = "(//span[@class=' css-12i50in'])[position() mod 3 = 1]") %>%
            html_text2(),
        funny = review_top %>%
            html_elements(xpath = "(//span[@class=' css-12i50in'])[position() mod 3 = 2]") %>%
            html_text2(),
        cool = review_top %>%
            html_elements(xpath = "(//span[@class=' css-12i50in'])[position() mod 3 = 0]") %>%
            html_text2()
    )

CodePudding user response:

An easier way to approach this is to extract the 10 reviews on the page and then extract the desire information from each review. Each review is a identified as as <div class = "review__09f24__oHr9V"> node underneath a <li class=margin-b5__09f24__pTvws div.review__09f24__oHr9V> node.

The use of html_element() function no "s" will always provide 1 and one 1 result even if that result is NA. This way we extracting information from 10 nodes and thus will return the first piece of information in each of the 10. Any duplicates are ignored and missing values are returned as NA.

library(rvest)

page <- "https://www.yelp.com/biz/funny-bbq-new-york-2?start=40" %>%  read_html()

#Find all of the top level reviews
reviews <- page %>%  html_elements("li.margin-b5__09f24__pTvws div.review__09f24__oHr9V")

#extract the desired information from each review
dates <- reviews %>% html_element(".margin-t1__09f24__w96jn") %>% html_text()
review_text = reviews %>% html_element(".raw__09f24__T4Ezm") %>% html_text()
stars <- reviews %>% html_element("div.i-stars__09f24__M1AR7")%>% html_attr("aria-label")
useful = reviews %>%  html_element(xpath = "(//span[@class=' css-12i50in'])[position() mod 3 = 1]") %>%
   html_text2()

funny = reviews %>% html_elements(xpath = "(//span[@class=' css-12i50in'])[position() mod 3 = 2]") %>%
   html_text2()

cool = reviews %>% html_elements(xpath = "(//span[@class=' css-12i50in'])[position() mod 3 = 0]") %>%
   html_text2()

answer <- tibble(
   dates, review_text, stars,  useful,  funny, cool)
  • Related