R: Webscraping Pizza Shops - "read_html" not working?-CodePudding

I am working with the R programming language.

I trying to scrape the name and address of the pizza stores on this website https://www.yellowpages.ca/search/si/2/pizza/Canada (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada, https://www.yellowpages.ca/search/si/3/pizza/Canada, https://www.yellowpages.ca/search/si/4/pizza/Canada, etc.)

I am trying to follow the answer provided here: Scraping Yellowpages in R

library(rvest)
library(stringr)

url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater Sydney, NSW&lat=&lon=&selectedViewMode=list"


library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("@","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

But this code is taking a very long time to run. I tried to investigate by running individual parts of the function, and I think I found the problem: The "read_html" statement itself is not working. I tried to replace this with another statement:

 library(httr)
 webpage <- GET(url)

This works, but now the format is not the same.

Can someone please show me how to do this?

In the end, I would like the output to look something like this:

  id                 name                                       address
1  1   OJ's Steak & Pizza 9906B Franklin Ave, Fort McMurray, AB T9H 2K5
2  2    MJs Pizza & Grill 10012 Franklin Ave, Fort McMurray, AB T9H 2K6
3  3 Hu's Pizza & Donairs 10020 Franklin Ave, Fort McMurray, AB T9H 2K6

# sample results

sample_results = structure(list(id = c(1, 2, 3), name = c("OJ's Steak & Pizza", 
"MJs Pizza & Grill", "Hu's Pizza & Donairs"), address = c("9906B Franklin Ave, Fort McMurray, AB T9H 2K5", 
"10012 Franklin Ave, Fort McMurray, AB T9H 2K6", "10020 Franklin Ave, Fort McMurray, AB T9H 2K6"
)), class = "data.frame", row.names = c(NA, -3L))

Thanks!

CodePudding user response：

Fast, but not robust. (If there are missing either name or address, the code will break, I think.)

library(tidyverse)
library(rvest)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")

# A tibble: 35 x 2
   name                                  address                                 
   <chr>                                 <chr>                                   
 1 OJ's Steak & Pizza                    9906B Franklin Ave, Fort McMurray, AB T~
 2 MJs Pizza & Grill                     10012 Franklin Ave, Fort McMurray, AB T~
 3 Hu's Pizza & Donairs                  10020 Franklin Ave, Fort McMurray, AB T~
 4 Eagle Ridge Convenience Store & Pizza 117-375 Loutit Rd, Fort McMurray, AB T9~
 5 Cosmos Pizza                          9713 Hardin St, Fort McMurray, AB T9H 1~
 6 Boston Pizza                          10202 MacDonald Ave, Fort McMurray, AB ~
 7 Jomaa's Pizza & Chicken               Beacon Hill Shpg Plaza, Fort McMurray, ~
 8 Abasand PK's Pizza                    101-307 Athabasca Ave, Fort McMurray, A~
 9 Pizza 73                              1-289 Powder Dr, Ft McMurray, AB T9K 0M5
10 Boston Pizza                          110 Millennium Dr, Fort McMurray, AB T9~
# ... with 25 more rows
# i Use `print(n = ...)` to see more rows