Home > OS >  R Scraping for image details from several thousand pages
R Scraping for image details from several thousand pages

Time:12-13

I am trying to scrape details from a website in order to gather details for pictures with a script in R.

What I need is:

  • Image name (1.jpg)
  • Image caption ("A recruit demonstrates the proper use of a CO2 portable extinguisher to put out a small outside fire.")
  • Photo credit ("Photo courtesy of: James Fortner")

There are over 16,000 files, and thankfully the web url goes "...asp?photo=1, 2, 3, 4" so there is base url which doesn't change, just the last section with the image number. I would like the script to loop for either a set number (I tell it where to start) or it just breaks when it gets to a page which doesn't exisit.

Using the code below, I can get the caption of the photo, but only one line. I would like to get the photo credit, which is on a separate line; there are three
between the main caption and photo credit. I'd be fine if the table which is generated had two or three blank columns to account for the
lines, as I can delete them later.

library(rvest)
library(dplyr)

link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)

caption = page %>% html_nodes(".text7 i") %>% html_text()

info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")

CodePudding user response:

For the images, you can use the command line tool curl. For example, to grab images 1.jpg through 100.jpg

curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"

For the R code, if you grab the whole .text7 section, then you can split into caption and photo credit subsequently:

extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]

As a loop

library(rvest)
library(dplyr)
df<-data.frame(id=1:20,
               caption=NA,
               credit=NA)
for (i in 1:20){
  cat(i, " ")
  link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
  tryCatch({
            page <- read_html(link)
            extractedtext <- page %>% html_nodes(".text7") %>% html_text()
            df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
            df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
            }, 
           error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

CodePudding user response:

Scraping with rvest and tidyverse

library(tidyverse)
library(rvest)

get_picture <- function(page) {
  cat("Scraping page", page, "\n")
  
  page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
    read_html()
  
  tibble(
    image_name = page %>%  
      html_element(".text7 img") %>%
      html_attr("src"),
    caption = page %>%
      html_element(".text7") %>%
      html_text() %>%
      str_split(pattern = "\r\n\t\t\t\t") %>%
      unlist %>% 
      nth(1),
    credit = page %>%
      html_element(".text7") %>%
      html_text() %>%
      str_split(pattern = "\r\n\t\t\t\t") %>%
      unlist %>% 
      nth(3)
  )
}

# Get the first 1:50 
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))

# A tibble: 42 × 3
   image_name     caption                                   credit
   <chr>          <chr>                                     <chr> 
 1 /photos/1.jpg  Recruit Clay Hamric demonstrates the use… James…
 2 /photos/2.jpg  A recruit demonstrates the proper use of… James…
 3 /photos/3.jpg  Recruit Paul Melnick demonstrates the pr… James…
 4 /photos/4.jpg  Rescue 104                                James…
 5 /photos/5.jpg  Rescue 104                                James…
 6 /photos/6.jpg  Rescue 104                                James…
 7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
 8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
 9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
  • Related