Here's the context of the problem I'm facing:
I have 202 URLs stored in a vector and I'm trying to scrape information from them using a for loop.
The URLs are basically every product that shows up within this website:
However that's not the case - when I go the URLs that faced this problem, they work just fine.
Plus, if that were the case, when I ran the code again, I should get the error for the same values inside the vector. But they seem to be happening randomly.
For example:
The first time I ran the code, I got the error on vector[6].
The second time I ran the same snippet, scraping vector [6] worked just fine.
It was also suggested that I should use try () or tryCatch() to avoid the error from stopping the for loop.
And for that purpose, try() worked.
However it would be preferable if I could avoid getting the error - because if I don't, I'll have to run the same snippet of code a few times in order to scrape every value I need.
Can anyone help me, please?
Why is it happening and what can I do to prevent it?
Here's the code I'm running, if it helps:
for (i in 1:length(standard_ad)) {
try(
collectedtitles <- collect(standard_ad[i],'.ui-pdp-title'))
assign('standard_titles', append(standard_titles, collectedtitles))
}
'Collect' being a function I created:
collect <- function(webpage,section) {
page <- read_html(webpage)
value <- html_node(page, section)
value <- html_text(value)
}
CodePudding user response:
From the link you provided, I scrape the five available pages for that query without getting any error. Would you care to better explain how you got your error?
get_products <- function(page) {
cat("Scraping index", page, "\n")
page <- str_c(
"https://lista.mercadolivre.com.br/",
"_Desde_",
page,
"_CustId_38356530_NoIndex_True"
) %>%
read_html()
tibble(
title = page %>%
html_elements(".shops__item-title") %>%
html_text2() %>%
str_squish(),
price = page %>%
html_elements(".ui-search-layout__item") %>%
html_element(".price-tag-text-sr-only") %>%
html_text2() %>%
str_replace_all(" reais con ", ".") %>%
str_remove_all(" centavos| reais") %>%
as.numeric(),
product_link = page %>%
html_elements(".ui-search-result__content.ui-search-link") %>%
html_attr("href")
)}
df <- map_dfr(seq(1, 49 * 4, by = 48), get_products)
Scraping the amount sold from individual product sites with polite
package. Polite was designed to be scraping friendly towards sites, therefore it will be slower than rvest
but more reliable in certain scenarios. I have scraped 20 pages successfully without any issues. Run the previous code and then this one:
library(polite)
sold_amount <- function(product_link) {
cat("Scraping", product_link, "\n")
product_link %>%
bow(force = TRUE) %>%
scrape() %>%
html_element(".ui-pdp-subtitle") %>%
html_text2() %>%
str_remove_all("[^0-9]") %>%
as.numeric()
}
df <- df %>%
mutate(sold = map_dbl(product_link, sold_amount))
# A tibble: 20 × 4
title price product_link sold
<chr> <dbl> <chr> <dbl>
1 Fone Superlux Hd661 Para Retorno Baterista Teclado … 433. https://pro… NA
2 Kit Caixas Donner Saga 12 Ativa 250w Passiva 130w… 2290 https://pro… 8
3 Fone Para Gamer Jogar Ps4 Pc Xbox One P2 Celular He… 292 https://pro… NA
4 Pandeiro Contemporânea 10 Polegadas Leve Light Cour… 233. https://pro… 131
5 Caixa De Som Ll Audio Up ! 8 Com Bluetooth Fm Usb E… 658 https://pro… 4
6 Violão De Nylon Náilon Giannini N-14 Natural Série … 464 https://pro… 2
7 Amplificador De Som Receiver Sa20 100w Usb Card Blu… 948 https://pro… 1
8 Amplificador Randall Big Dog 15w Guitarra Mostruári… 434 https://pro… NA
9 Caixa De Som Leacs 10'' Fit 160 Passiva Retorno Mon… 890 https://pro… NA
10 Caixa De Som Amplificada Ll Lx40 Microfone Guitarra… 444 https://pro… 33
11 Violão De Nylon Náilon Giannini N-14 Natural Série … 440 https://pro… 2
12 Microfone Sem Fio Jwl Headset Duplo Uhf U-585 Hh … 700. https://pro… 42
13 Direct Box Single-channel Bypass Waldman - Passivo … 159. https://pro… 1
14 Caixa Acústica Donner Saga 12 Duas Vias Passiva 130… 840. https://pro… NA
15 Mesa De Som Mixer Nca Nanomix Ll Na402r 4 Canais Bi… 359 https://pro… 14
16 Guitarra elétrica Tagima TW Series TW-61 de choupo … 1529 https://www… 154
17 5 Microfone Superlux De Mão Dinâmico C1 Cachimbo … 1598 https://pro… 3
18 Estante Máquina Ferragem Suporte Chimbal Turbo Powe… 378. https://pro… 5
19 Microfone Skypix M58 Para Igreja, Banda, Eventos...… 98 https://pro… NA
20 Teclado Yamaha Psr F52 Com Fonte Bivolt - F 52 977 https://www… 666