I am working with the R programming language.
I trying to scrape the name and address of the pizza stores on this website https://www.yellowpages.ca/search/si/2/pizza/Canada (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada, https://www.yellowpages.ca/search/si/3/pizza/Canada, https://www.yellowpages.ca/search/si/4/pizza/Canada, etc.)
I am trying to follow the answer provided here: Scraping Yellowpages in R
library(rvest)
library(stringr)
url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater Sydney, NSW&lat=&lon=&selectedViewMode=list"
library(rvest)
library(stringr)
testscrape <- function(url){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("@","@")
n <- seq_len(max(length(practice), length(ph_no), length(email)))
tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)
But this code is taking a very long time to run. I tried to investigate by running individual parts of the function, and I think I found the problem: The "read_html" statement itself is not working. I tried to replace this with another statement:
library(httr)
webpage <- GET(url)
This works, but now the format is not the same.
Can someone please show me how to do this?
In the end, I would like the output to look something like this:
id name address
1 1 OJ's Steak & Pizza 9906B Franklin Ave, Fort McMurray, AB T9H 2K5
2 2 MJs Pizza & Grill 10012 Franklin Ave, Fort McMurray, AB T9H 2K6
3 3 Hu's Pizza & Donairs 10020 Franklin Ave, Fort McMurray, AB T9H 2K6
# sample results
sample_results = structure(list(id = c(1, 2, 3), name = c("OJ's Steak & Pizza",
"MJs Pizza & Grill", "Hu's Pizza & Donairs"), address = c("9906B Franklin Ave, Fort McMurray, AB T9H 2K5",
"10012 Franklin Ave, Fort McMurray, AB T9H 2K6", "10020 Franklin Ave, Fort McMurray, AB T9H 2K6"
)), class = "data.frame", row.names = c(NA, -3L))
Thanks!
CodePudding user response:
Fast, but not robust. (If there are missing either name or address, the code will break, I think.)
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")
# A tibble: 35 x 2
name address
<chr> <chr>
1 OJ's Steak & Pizza 9906B Franklin Ave, Fort McMurray, AB T~
2 MJs Pizza & Grill 10012 Franklin Ave, Fort McMurray, AB T~
3 Hu's Pizza & Donairs 10020 Franklin Ave, Fort McMurray, AB T~
4 Eagle Ridge Convenience Store & Pizza 117-375 Loutit Rd, Fort McMurray, AB T9~
5 Cosmos Pizza 9713 Hardin St, Fort McMurray, AB T9H 1~
6 Boston Pizza 10202 MacDonald Ave, Fort McMurray, AB ~
7 Jomaa's Pizza & Chicken Beacon Hill Shpg Plaza, Fort McMurray, ~
8 Abasand PK's Pizza 101-307 Athabasca Ave, Fort McMurray, A~
9 Pizza 73 1-289 Powder Dr, Ft McMurray, AB T9K 0M5
10 Boston Pizza 110 Millennium Dr, Fort McMurray, AB T9~
# ... with 25 more rows
# i Use `print(n = ...)` to see more rows