Home > Software design >  how do i avoid error in open.connection(x, "rb") : HTTP error 404 when webscraping with rv
how do i avoid error in open.connection(x, "rb") : HTTP error 404 when webscraping with rv

Time:01-04

Here's the context of the problem I'm facing:

I have 202 URLs stored in a vector and I'm trying to scrape information from them using a for loop.

The URLs are basically every product that shows up within this website: enter image description here

However that's not the case - when I go the URLs that faced this problem, they work just fine.

Plus, if that were the case, when I ran the code again, I should get the error for the same values inside the vector. But they seem to be happening randomly.

For example:

  • The first time I ran the code, I got the error on vector[6].

  • The second time I ran the same snippet, scraping vector [6] worked just fine.

It was also suggested that I should use try () or tryCatch() to avoid the error from stopping the for loop.

And for that purpose, try() worked.

However it would be preferable if I could avoid getting the error - because if I don't, I'll have to run the same snippet of code a few times in order to scrape every value I need.

Can anyone help me, please?

Why is it happening and what can I do to prevent it?

Here's the code I'm running, if it helps:

for (i in 1:length(standard_ad)) { 
  try(
  collectedtitles <- collect(standard_ad[i],'.ui-pdp-title'))
  assign('standard_titles', append(standard_titles, collectedtitles))
}

'Collect' being a function I created:

collect <- function(webpage,section) {
  page <- read_html(webpage)
  value <- html_node(page, section)
  value <- html_text(value)
}

CodePudding user response:

From the link you provided, I scrape the five available pages for that query without getting any error. Would you care to better explain how you got your error?

get_products <- function(page) {
  cat("Scraping index", page, "\n")
  page <- str_c(
    "https://lista.mercadolivre.com.br/",
    "_Desde_",
    page,
    "_CustId_38356530_NoIndex_True"
  ) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".shops__item-title") %>%
      html_text2() %>%
      str_squish(),
    price = page %>%
      html_elements(".ui-search-layout__item") %>%
      html_element(".price-tag-text-sr-only") %>%
      html_text2() %>%
      str_replace_all(" reais con ", ".") %>%
      str_remove_all(" centavos| reais") %>%
      as.numeric(), 
    product_link = page %>% 
      html_elements(".ui-search-result__content.ui-search-link") %>% 
      html_attr("href")
  )}

df <- map_dfr(seq(1, 49 * 4, by = 48), get_products)

Scraping the amount sold from individual product sites with polite package. Polite was designed to be scraping friendly towards sites, therefore it will be slower than rvest but more reliable in certain scenarios. I have scraped 20 pages successfully without any issues. Run the previous code and then this one:

library(polite) 

sold_amount <- function(product_link) {
  cat("Scraping", product_link, "\n")
  product_link %>% 
    bow(force = TRUE) %>% 
    scrape() %>%  
    html_element(".ui-pdp-subtitle") %>%  
    html_text2() %>%  
    str_remove_all("[^0-9]") %>% 
    as.numeric()
}

df <- df %>%  
  mutate(sold = map_dbl(product_link, sold_amount))

# A tibble: 20 × 4
   title                                                price product_link  sold
   <chr>                                                <dbl> <chr>        <dbl>
 1 Fone Superlux Hd661 Para Retorno Baterista Teclado …  433. https://pro…    NA
 2 Kit Caixas Donner Saga 12 Ativa 250w   Passiva 130w… 2290  https://pro…     8
 3 Fone Para Gamer Jogar Ps4 Pc Xbox One P2 Celular He…  292  https://pro…    NA
 4 Pandeiro Contemporânea 10 Polegadas Leve Light Cour…  233. https://pro…   131
 5 Caixa De Som Ll Audio Up ! 8 Com Bluetooth Fm Usb E…  658  https://pro…     4
 6 Violão De Nylon Náilon Giannini N-14 Natural Série …  464  https://pro…     2
 7 Amplificador De Som Receiver Sa20 100w Usb Card Blu…  948  https://pro…     1
 8 Amplificador Randall Big Dog 15w Guitarra Mostruári…  434  https://pro…    NA
 9 Caixa De Som Leacs 10'' Fit 160 Passiva Retorno Mon…  890  https://pro…    NA
10 Caixa De Som Amplificada Ll Lx40 Microfone Guitarra…  444  https://pro…    33
11 Violão De Nylon Náilon Giannini N-14 Natural Série …  440  https://pro…     2
12 Microfone Sem Fio Jwl Headset Duplo Uhf U-585 Hh   …  700. https://pro…    42
13 Direct Box Single-channel Bypass Waldman - Passivo …  159. https://pro…     1
14 Caixa Acústica Donner Saga 12 Duas Vias Passiva 130…  840. https://pro…    NA
15 Mesa De Som Mixer Nca Nanomix Ll Na402r 4 Canais Bi…  359  https://pro…    14
16 Guitarra elétrica Tagima TW Series TW-61 de choupo … 1529  https://www…   154
17 5 Microfone Superlux De Mão Dinâmico C1   Cachimbo … 1598  https://pro…     3
18 Estante Máquina Ferragem Suporte Chimbal Turbo Powe…  378. https://pro…     5
19 Microfone Skypix M58 Para Igreja, Banda, Eventos...…   98  https://pro…    NA
20 Teclado Yamaha Psr F52 Com Fonte Bivolt - F 52        977  https://www…   666
  • Related