Home > Back-end >  scrap pictures using rvest
scrap pictures using rvest

Time:10-19

I'm trying to scrap a picture using rvest, with this code:

url <- "https://fr.wikipedia.org/wiki/Robert_Jardillier"
webpage <- html_session(url)
link.titles <- webpage %>% html_nodes(".noarchive .image img")

img.url <- link.titles %>% html_attr("src")

download.file(img.url, "test.png", mode = "wb")

But when trying to download this, I have the following message :

trying URL '//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Robert_Jardillier_1932.jpg/220px-Robert_Jardillier_1932.jpg'
Error in download.file(img.url, "test.png", mode = "wb") : 
  cannot open URL '//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Robert_Jardillier_1932.jpg/220px-Robert_Jardillier_1932.jpg'
In addition: Warning message:
In download.file(img.url, "test.png", mode = "wb") :
  URL '//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Robert_Jardillier_1932.jpg/220px-Robert_Jardillier_1932.jpg': status was 'URL using bad/illegal format or missing URL'

Any help :) ?

CodePudding user response:

Try:

download.file(paste0("http:",img.url), "test.png", mode = "wb")

CodePudding user response:

This worked with me.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})

url <- "https://fr.wikipedia.org/wiki/Robert_Jardillier"
page <- read_html(url)

page %>%
  html_elements("a") %>%
  html_attr("href") %>%
  grep("Robert_Jardillier.*\\.jpg", ., value = TRUE) %>%
  unique() %>%
  basename() %>%
  paste0(url, "#/media/", .) %>%
  download.file(destfile = "test.jpg")
  • Related