Home > Software engineering >  I wanna do web scraping in R and download zip files but I get this error
I wanna do web scraping in R and download zip files but I get this error

Time:11-27

I wanna download all the zip files automatically from web and save them with their own names in the specific folder, but I'm new in web scraping. How could I fix my code?

this is the error:

Error in rawToChar(out) : 
  embedded nul in string: '<!DOCTYPE html>\n<html dir="rtl" lang="fa-IR">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="UTF-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<link rel="shortcut icon" href="http://forum.konkur.in/favicon.ico">\n<!--[if IE]>\n        <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>\n        <script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>\n    <![endif]--><meta name="keywords" content=" ع©ظ†ع©ظ\210ط± , ط³ظ\210ط§ظ„ط§طھ ع©ظ†ع©ظ\210ط± ,ط¢ط²ظ…ظ\210ظ†ظ‡ط§غŒ ط¢ط²ظ…ط§غŒط´غŒ,ط¯ط§ظ†ظ„ظ\210ط¯ ع©طھط§ط¨ ط¯ط§ظ†ط´ع¯ط§ظ‡غŒ, ط§ط®ط¨ط§ط± ع©ظ†ع©ظ\210ط±,ط¢ط²ظ…ظ\210ظ†ظ‡ط§غŒ ط¹ظ„ظ\210ظ… ظ¾ط²ط´ع©غŒ,ط³ظ\210ط§ظ„ط§طھ ع©ظ†ع©ظ\210ط± ط§ط±ط´ط¯, ط³ظ\210ط§ظ„ط§طھ ع©ظ†ع©ظ\210ط± ط¯ع©طھط±غŒ">\n</head>\n<body>\n<h1>\n<title>ط¯ط§ظ†ظ„ظ\210ط¯ ط³ظ\210ط§ظ„ط§طھ ظ\210 ظ¾ط§ط³ط® ع©ظ†ع©ظ\210ط± ط§ط±ط´ط¯
In addition: There were 32 warnings (use warnings() to see them)

I read questions about this topic but couldn't fix my code.

This is my code:

library(tidyverse)
library(rvest)
library(stringr)

page = read_html("http://konkur.in/5850/دانلود-رایگان-سوالات-و-پاسخ-ارشد-92.html")
links1 = page %>% html_nodes(".text-single a") %>% map(html_attr, "href")
links = links[c(5:134)]

get_pdf = function(link){
  zip_page=read_html(link)
  zip = zip_page%>% html_nodes(".cont-donwload a")%>% map(html_attr, "href")
  Sys.sleep(1)  
  download.file(paste0(zip_page,zip[1]),zip[1])
  return(zip)
}

zip_files = sapply(links, FUN=get_pdf)

CodePudding user response:

Here is a way.
If there were no trappable error, the return value of the function is - from help("download.file")

Value
An (invisible) integer code, 0 for success and non-zero for failure.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})

get_pdf <- function(link, download_path){
  if(is.null(download_path)) {
    download_path <- tempdir()
  }
  if(!dir.exists(download_path)) {
    dir.create(download_path)
  }
  
  zip_page <- read_html(link)
  zip <- zip_page %>% 
    html_elements(".cont-donwload a") %>% 
    html_attr("href") %>%
    lapply(\(x) {
      dest <- file.path(download_path, basename(x))
      Sys.sleep(1)  
      tryCatch(download.file(x, dest),
               error = function(e) e
      )
    })
  zip
}

page <- read_html("http://konkur.in/5850/دانلود-رایگان-سوالات-و-پاسخ-ارشد-92.html")
links <- page %>%
  html_elements(".text-single a") %>%
  html_attr("href") %>%
  grep("\\.html", ., value = TRUE)

dest_path <- tempdir()
zip_ok <- lapply(links, FUN = get_pdf, download_path = dest_path)
zip_ok <- unlist(zip_ok)
  • Related