R: Webscraping Pizza Shops - Adding Phone Numbers?-CodePudding

I am working with the R programming language.

I trying to scrape the name, address and phone numbers of the pizza stores on this website :

But I am not sure how I can include this specification in the existing webscraping code.

Can someone please show me how to do this?

Thanks!

CodePudding user response：

This is pretty straight forward. The key is to perform the parsing in two steps. First find the parent node for each business then extract out the phone number.

library(rvest)

page<-read_html("https://www.yellowpages.ca/search/si/2/pizza/Canada")
#get the individual business card
nodes <- page %>% html_elements("div.listing_right_section")
#find the phone number node
phone <-  nodes %>% html_element("ul h4") %>% html_text()

In this case the phone numbers are within a "h4" tag underneath an "ul" tag.

Update
Incorporating the above code into your code:

scraper <- function(url) {
   page <- url %>% read_html(url)
   
   #find the business records nodes
   businesses <- page %>% html_elements("div.listing_right_section")
   
   #now extract the request information from each node
   tibble(
      name = businesses %>% html_element(".jsListingName") %>% html_text2(),
      address = businesses %>% html_element(".listing__address--full") %>% html_text2(),
      phone =  businesses %>% html_element("ul h4") %>% html_text()
   )
}

name_address = scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")

The key for successful scraping is to use html_elements() to grab all of the parent nodes, each of which contain a full record. Then use html_element() (without the s) to grab the wanted field (child node) out of each parent node.
This will always work. Using html_elements will provide vectors with a variable number of elements if there is some missing information.

CodePudding user response：

@ Dave2e: I took the code provided in your answer and tried to incorporate it into the existing code:

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

name_address = scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")


library(rvest)

page<-read_html("https://www.yellowpages.ca/search/si/2/pizza/Canada")
#get the individual business card
nodes <- page %>% html_elements("div.listing_right_section")
#find the phone number node
phone <-  nodes %>% html_element("ul h4") %>% html_text()

final = cbind(name_address, phone)

Have I done this correctly? Is there a better way (e.g. more efficient) to do this?

Thank you so much!