I wanna download all the zip files automatically from web and save them with their own names in the specific folder, but I'm new in web scraping. How could I fix my code?
this is the error:
Error in rawToChar(out) :
embedded nul in string: '<!DOCTYPE html>\n<html dir="rtl" lang="fa-IR">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="UTF-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<link rel="shortcut icon" href="http://forum.konkur.in/favicon.ico">\n<!--[if IE]>\n <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>\n <script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>\n <![endif]--><meta name="keywords" content=" ع©ظ†ع©ظ\210ط± , ط³ظ\210ط§ظ„ط§طھ ع©ظ†ع©ظ\210ط± ,ط¢ط²ظ…ظ\210ظ†ظ‡ط§غŒ ط¢ط²ظ…ط§غŒط´غŒ,ط¯ط§ظ†ظ„ظ\210ط¯ ع©طھط§ط¨ ط¯ط§ظ†ط´ع¯ط§ظ‡غŒ, ط§ط®ط¨ط§ط± ع©ظ†ع©ظ\210ط±,ط¢ط²ظ…ظ\210ظ†ظ‡ط§غŒ ط¹ظ„ظ\210ظ… ظ¾ط²ط´ع©غŒ,ط³ظ\210ط§ظ„ط§طھ ع©ظ†ع©ظ\210ط± ط§ط±ط´ط¯, ط³ظ\210ط§ظ„ط§طھ ع©ظ†ع©ظ\210ط± ط¯ع©طھط±غŒ">\n</head>\n<body>\n<h1>\n<title>ط¯ط§ظ†ظ„ظ\210ط¯ ط³ظ\210ط§ظ„ط§طھ ظ\210 ظ¾ط§ط³ط® ع©ظ†ع©ظ\210ط± ط§ط±ط´ط¯
In addition: There were 32 warnings (use warnings() to see them)
I read questions about this topic but couldn't fix my code.
This is my code:
library(tidyverse)
library(rvest)
library(stringr)
page = read_html("http://konkur.in/5850/دانلود-رایگان-سوالات-و-پاسخ-ارشد-92.html")
links1 = page %>% html_nodes(".text-single a") %>% map(html_attr, "href")
links = links[c(5:134)]
get_pdf = function(link){
zip_page=read_html(link)
zip = zip_page%>% html_nodes(".cont-donwload a")%>% map(html_attr, "href")
Sys.sleep(1)
download.file(paste0(zip_page,zip[1]),zip[1])
return(zip)
}
zip_files = sapply(links, FUN=get_pdf)
CodePudding user response:
Here is a way.
If there were no trappable error, the return value of the function is - from help("download.file")
Value
An (invisible) integer code, 0 for success and non-zero for failure.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
})
get_pdf <- function(link, download_path){
if(is.null(download_path)) {
download_path <- tempdir()
}
if(!dir.exists(download_path)) {
dir.create(download_path)
}
zip_page <- read_html(link)
zip <- zip_page %>%
html_elements(".cont-donwload a") %>%
html_attr("href") %>%
lapply(\(x) {
dest <- file.path(download_path, basename(x))
Sys.sleep(1)
tryCatch(download.file(x, dest),
error = function(e) e
)
})
zip
}
page <- read_html("http://konkur.in/5850/دانلود-رایگان-سوالات-و-پاسخ-ارشد-92.html")
links <- page %>%
html_elements(".text-single a") %>%
html_attr("href") %>%
grep("\\.html", ., value = TRUE)
dest_path <- tempdir()
zip_ok <- lapply(links, FUN = get_pdf, download_path = dest_path)
zip_ok <- unlist(zip_ok)