I am trying to programmatically download the two zip files on this page:
https://sites.google.com/site/ucinetsoftware/datasets/covert-networks/siren
The two zip files are actually on separate pages, but the href
to those pages are inside this page. So, what I want to do:
- get the links to the pages where each of the two zip files reside (they are on a public google drive)
- download the two zip files to my computer
(yes, I know I can download them manually, but there are more pages I need to download from, so I would like to automate this process)
Unfortunately, I can't even get the first step going. I start with loading the page into rvest
and then try to get the element div.flip-entry-info
but this yields no results. I believe this is because it is part of an iframe
inside this page. So, how do access the elements that contain the href
that point to the actual location of these files?
For the second step, I need to find a way to download the data from the google drive.
For example, one of these two zip files is available at: https://drive.google.com/file/d/1BFN_1n-5EZ3rLrqrqWsAsBR9exjXuUKF/view.
But I have absolutely no clue as to download the file from there. The 'inspect' option in Chrome doesn't work on this page and selectorgadget
doesn't reveal anything useful either.
Can anyone help me to download these files through R? I am totally stuck.
CodePudding user response:
We can get the links inside the iframe
You can refer tutorial here,
https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md#rvest7.2
library(rvest)
library(magrittr)
link = 'https://sites.google.com/site/ucinetsoftware/datasets/covert-networks/siren' %>%
read_html() %>%
html_nodes("iframe") %>% html_attr("src")
#get links of both the files
link %>% read_html() %>% html_nodes(".flip-entry-info") %>% html_nodes('a') %>%
html_attr('href')
[1] "https://drive.google.com/file/d/1cio3RzjDO6e78PKdEFSPgdw4tCJ7_VUi/view?usp=drive_web"
[2] "https://drive.google.com/file/d/1BFN_1n-5EZ3rLrqrqWsAsBR9exjXuUKF/view?usp=drive_web"
To download the files we can use googledrive
library.
library(googledrive)
temp <- tempfile(fileext = ".zip")
drive_download(
as_id("https://drive.google.com/file/d/1cio3RzjDO6e78PKdEFSPgdw4tCJ7_VUi/view?usp=drive_web"), path = temp, overwrite = TRUE)
out<- unzip(temp, exdir = tempdir())
df<- read.csv(out, sep = ",")
str(df)
'data.frame': 44 obs. of 45 variables:
$ ï..: int 1 2 3 4 5 6 7 8 9 10 ...
$ X1 : int 0 1 1 1 1 1 1 1 1 1 ...
$ X2 : int 1 0 0 0 0 0 0 0 0 0 ...
$ X3 : int 1 0 0 0 0 1 0 0 0 0 ...