I am working with the R programming language.
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results = list()
for (i in 1:391)
{
url_i = paste0(a,i,b)
s_i = data.frame(scraper(url_i))
ss_i = data.frame(i,s_i)
print(ss_i)
list_results[[i]] <- ss_i
}
final = do.call(rbind.data.frame, list_results)
My Problem: I noticed that after the 60th page, I get the following error:
Error in data.frame(i, s_i) :
arguments imply differing number of rows: 1, 0
In addition: Warning message:
In for (i in seq_along(specs)) { :
closing unused connection
To investigate, I went to the 60th page (
My Question: Is there something that I can do differently to try and move past the 60th page, or is there some internal limitation within YellowPages that is preventing from me scraping further?
Thanks!
CodePudding user response:
This is a limit in the Yellow Pages preventing to continue to the next page. A solution is to assign the return value of scraper
and check the number of rows. If it is 0, break the for
loop.
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results <- list()
for (i in 1:391) {
url_i = paste0(a,i,b)
s <- scraper(url_i, i)
message(paste("page number:", i, "\trows:", nrow(s)))
if(nrow(s) > 0L) {
s_i <- as.data.frame(s)
ss_i <- data.frame(i, s_i)
} else {
message("empty page, bailing out...")
break
}
list_results[[i]] <- ss_i
}
final <- do.call(rbind.data.frame, list_results)
dim(final)
# [1] 2100 3