Can not access data on web -URL HTTP status was '403 Forbidden'--CodePudding

Here is my simple code

url1 <- 'https://www.sec.gov/Archives/edgar/data/0001336528/0001172661-21-001865.txt'
data1 <- readLines(url1)

the answer is <cannot open URL 'https://www.sec.gov/Archives/edgar/data/0001336528/0001172661-21-001865.txt': HTTP status was '403 Forbidden' Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") :>

I tried a lot of ways, and I reached the conclusion that the site rejects my request when it is made from R (with that or any code). Sometimes, I got no error and the code worked fine but no usually. I can always save the .txt directly from the browser (I can not save it to my pc using R) and then import from the file in my pc.

Example -> I save page as .txt and then

data1 <- readLines("Persh01.txt")

As it worked sometimes, I also created a loop that tried until done, and it did the job, but I changed the pc and it does not seem to work anymore.

data1 <- NA
data1 <- try(readLines(url1))
while (inherits(data1, "try-error")) {
  data1 <- try(readLines(url1))
}

Would someone help me? Thanks

CodePudding user response：

You need to pass a couple of headers to the server before it accepts your request. In this case, you need an appropriate User-Agent string and a Connection = "keep alive" to prevent the 403 error.

library(httr)

url1 <- 'https://www.sec.gov/Archives/edgar/data/0001336528/0001172661-21-001865.txt'
UA <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0"

res   <- GET(url1, add_headers(`Connection` = "keep-alive", `User-Agent` = UA))
data1 <- strsplit(content(res), "\n")[[1]]

head(data1, 10) 

#>  [1] "<SEC-DOCUMENT>0001172661-21-001865.txt : 20210816"   
#>  [2] "<SEC-HEADER>0001172661-21-001865.hdr.sgml : 20210816"
#>  [3] "<ACCEPTANCE-DATETIME>20210816163055"                 
#>  [4] "ACCESSION NUMBER:\t\t0001172661-21-001865"             
#>  [5] "CONFORMED SUBMISSION TYPE:\t13F-HR"                   
#>  [6] "PUBLIC DOCUMENT COUNT:\t\t2"                           
#>  [7] "CONFORMED PERIOD OF REPORT:\t20210630"                
#>  [8] "FILED AS OF DATE:\t\t20210816"                         
#>  [9] "DATE AS OF CHANGE:\t\t20210816"                        
#> [10] "EFFECTIVENESS DATE:\t\t20210816"

Note that the site's robot.txt file disallows web crawling and indexing from this part of the site, so you need to check you are not violating the site's usage policy.