Home > Net >  Saving variables to a dataframe/list with a nested loop
Saving variables to a dataframe/list with a nested loop

Time:09-21

I'm currently working with a project to scrape information off a website's source code. This is the code I got:

require(dplyr)
require(tidyverse)
require(stringi)
require(stringr)
require(rvest)
require(purrr)
library(data.table)

datalist = list()   
# Looping through all pages on the website
for (page in 1:91){


    # Constructing the URL to download the html source code from
    WebUrl <- paste0("https://www.apple.com/newsroom/archive/?page=",
            page)


    
    download.file(WebUrl, 
            destfile = paste0("tempdir/Source code", page, ".txt"))

    # Grabbing the relevant node from the source code and converting it to a df
    webpages_df <- webpages %>% 
       html_nodes("a") %>%
       map(html_attrs) %>%
       map_df(~as.list(.))

    # Removing NA values from the "aria-label" column where the relevant string is and 
    # renaming the column
    headlines <- as.data.frame(webpages_df$`aria-label`) %>%
       filter(!is.na(webpages_df$`aria-label`)) %>%
       setnames(old = "webpages_df$`aria-label`", new = "Strings")

    # Removing the not relevant strings
    # Regex is matching for any word with a number, comma and four digits behind. E.g 
    # September 1, 2021
    headlines <- headlines %>%
       filter(grepl("([A-Z]\\w \\s[0-9][0-9]?, \\s[0-9][0-9][0-9][0-9]?)", Strings))

    # Looping over the rows to extract the different variables and store them
    # Each variable is created with regex to extract the relevant information
    # The goal for the loop is to extract the values from a source file for a given 
    # node with relevant information

    for (r in 1:nrow(headlines)){
       dates <- stri_extract_all(headlines$Strings, regex = "([A-Z]\\w \\s[0-9][0-9]?, \\s[0- 9][0-9][0-9][0-9]?)")
       category <- stri_extract_all(headlines$Strings, regex = "([A-Z][A-Z][A-Z][A-Z][A-Z]\\s\\w |[A-Z][A-Z][A-Z]\\w )")
       titles <- str_remove_all(headlines$Strings, pattern = "([A-Z]\\w \\s[0-9][0-9]?, \\s[0-9][0-9][0-9][0-9]) (-\\s[A-Z]?\\w .[A-Z] ..)" )
       article.url <- webpages_df %>%
          filter(grepl( pattern = "(/[a-z]?\\w[0-9] /[0-9] /[a-z]?\\w .\\w [a-z]?)", href))
       article.url <- paste0("https://www.apple.com", article.url$href)

       tempmatrix <- matrix(c(dates, category, titles, article.url), ncol = 4)

       datalist[[r]] <- rbind(tempmatrix, datalist)
    }

}

This works to download all the various source codes across the pages to the set directory. But I can't seem to get the nested loop to work. My goal is to loop through each source code file and create the variables date, category, titles and urls, then append this to a list outside the loop. To later converting this to a structured dataframe with columns listed above.

While this code block do not work, I can get it to work without the nested loop and tempmatrix/datalist[[r]]. The result of this is only the last files information in the structure I want.

Would greatly appreciate input/tips on how to solve my issue at hand. I'm a novice in R so my code is probably inefficient.

CodePudding user response:

Try iteratively building a data.frame in your inner loop, then appending that dataframe to a list on each iteration of your outer loop. Let me know if this doesn't work.

for (page in 1:91){
   intermediate <- c()
   ...
   for (r in 1:nrow(headlines)){
       ...
       intermediate <- rbind(tempmatrix, intermediate)
   }
   datalist[[page]] <- intermediate
}
  • Related