Web crawling in R through multiple URLs-CodePudding

I'm working on a web crawling project where I'd like to start at a main ulr here: https://law.justia.com/codes/

I'd like to ultimately end up with a list of urls that contains actual state code text. For example, if you go to the webpage above, you can navigate to

Montana > 2021 Montana Code > Title 1 > General Provisions > Part 1 > 1-1-101 >

and then you land on a page that does not contain any further link for statute sections and instead has actual statute text. I'd like to collect the url for this page as well as all the other terminal pages.

I've started with the following code

library(rJava)
library(rvest)
library(purrr)
library(Rcrawler)

page <- LinkExtractor(url = "https://law.justia.com/codes/")

page$InternalLinks

new_links <- list()
for(i in 1:9){
  output <- LinkExtractor(url = page$InternalLinks[[i]])
  new_links[[i]] <- output
}

Which results in new_links, a list that contains lists for the first 9 urls (as a test I started with just 9 urls) and the internal links they contain. So 9 lists of three lists.

And that's where I'm at. I'm not sure where to go from here, I'm assuming it will involve a loop of some kind but I'm struggling to write something that doesn't result in a list of lists of lists of lists...

I'm also not sure yet how I will differentiate the terminal urls from urls that need to be searched for further urls.

CodePudding user response：

Here is a way to scrape all the links. It doesn't need rjava or Rcrawler loaded, it only uses rvest to scrape the web. I also load packages dplyr and stringr.

The code is not complicated, there's a function that scrapes the links until the texts links are reached. To get there, repeat the same code with the new links got from the previous step.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
  library(stringr)
})

scrape_law.justia.com_codes <- function(link, class, entry_point = FALSE) {
  div_class <- paste("div", class, sep = ".")
  if(entry_point) {
    HTML_FUN <- html_elements
  } else {
    HTML_FUN <- html_element
  }
  link %>%
    read_html() %>%
    HTML_FUN(div_class) %>%
    html_elements("a") %>%
    html_attr("href") %>%
    grep("codes", ., value = TRUE) %>%
    str_extract(".*(/codes/)(.*)", group = 2) %>%
    paste0(main_page_link, .)
}

main_page_link <- "https://law.justia.com/codes/"

# first, scrape the states' links
states_list <- scrape_law.justia.com_codes(main_page_link, "block", TRUE)
states_list <- states_list[-(1:2)]

# now the years
years_list <- states_list %>%
  lapply(scrape_law.justia.com_codes, class = "wrapper")
years_list <- unlist(years_list)

# then the titles
titles_list <- years_list %>%
  lapply(scrape_law.justia.com_codes, class = "codes-listing")
titles_list <- unlist(titles_list)

# then the chapters
chapters_list <- titles_list %>%
  lapply(scrape_law.justia.com_codes, class = "codes-listing")
chapters_list <- unlist(chapters_list)

# then the sections
sections_list <- chapters_list %>%
  lapply(scrape_law.justia.com_codes, class = "codes-listing")
sections_list <- unlist(sections_list)

# finally the texts
text_list <- sections_list %>%
  sapply(\(x) {
    x %>%
      read_html() %>%
      html_element("div.block") %>%
      html_elements("p") %>%
      html_text(trim = TRUE)
  })

# Note that this is a named list.
# Each list member is named after the link
# both ways below return the same text
text_list[["https://law.justia.com/codes/alaska/2021/title-1/chapter-05/section-01-05-006/"]]
text_list[[1]]

CodePudding user response：

That website is not as structured as first appears. For example, some chapters have articles before the sections. You can try the following code, which I tested for the first few sections of the state of Alabama:

scrape <- function(URL) {
  
  label <- list()
  text <- list()
  
  page <- read_html(paste0(URL, "/codes"))
  links <- page %>% html_nodes("a") %>% html_attr("href")
  
  # State links
  state_links <- grep('^[/]codes', links, value=TRUE)
  
  get_link <- function(URL, x){
    url <- paste0(URL, x) 
    page <- read_html(url)
    return(page %>% html_nodes("a") %>% html_attr("href"))
  }
    
  i <- 1
  for(s in state_links){
    links <- get_link(URL, s)
    
    # Year links
    regex <- paste(s, '[0-9]{4}', sep="")
    year_links <- grep(regex, links, value=TRUE)
    
    for(y in year_links) {
      links <- get_link(URL, y)
      
      # Title links
      regex <- paste0(y, "title-")
      title_links <- grep(regex, links, value=TRUE)
      
      for(t in title_links) {
        links <- get_link(URL, t)
        
        # Chapter links
        regex <- paste0(t, "chapter-")
        chapter_links <- grep(regex, links, value=TRUE)
        
        for(ch in chapter_links) {
          links <- get_link(URL, ch)
          
          # Article links (optional)
          regex <- paste(ch, 'article-', sep="")
          article_links <- grep(regex, links, value=TRUE)

          if(length(article_links)>0){
            for(ar in article_links) {
              links <- get_link(URL, ar)
              
              # Section links
              regex <- paste(ar, 'section-', sep="")
              section_links <- grep(regex, links, value=TRUE)
            }
          }
          else {
            # Section links (without articles)
            regex <- paste(ch, 'section-', sep="")
            section_links <- grep(regex, links, value=TRUE)
          }
          
          for(se in section_links) {
            url <- paste0(URL, se) 
            page <- read_html(url)
            
            label[[i]] <- paste(se, sep=" - ")
            text[[i]] <- page %>% html_nodes("p") %>% html_text()
            i <- i   1
            cat(paste("\nScraping:", se, "\n"))
          }
        }
      }
    }
  }
  return(list(label=label, text=text))
}

url <- "https://law.justia.com"
result <- scrape(url)