Home > Software engineering >  Reading HTML into an R data frame using rvest
Reading HTML into an R data frame using rvest

Time:10-01

I am trying to scrape data from https://homicides.news.baltimoresun.com/recent/ using rvest and put information on victims into a data table or frame.

What I have so far is:

html <- read_html(x = "https://homicides.news.baltimoresun.com/recent/")
html_node(html, ".recentvictims") %>% 
    html_children() %>% 
    head() %>% 
    html_text2()

which gives me a list of the information, but I can't find a way to put this into a data frame.

[1] "Date & time\nVictim name\nAddress\nAge\nGender\nRace"
[2] "09/26/2022 7:15 p.m.\n\n1900 Griffis Ave\n—\nMale\nUnknown"
[3] "09/21/2022 1:45 p.m.\nKelly Logan\n2100 Kloman St\n53\nFemale\nBlack"
[4] "09/20/2022 9:00 a.m.\nDelon Bushrod\n2800 Bookert Dr\n24\nMale\nBlack"
[5] "09/19/2022 8:06 p.m.\nTerry Gordon\n1600 N Wolfe St\n53\nMale\nBlack"
[6] "09/16/2022 9:43 a.m.\nDelanie McCloud\n100 Wilmott Court\n37\nMale\nBlack"

I've also tried selecting the html elements under ".recentelements"

minimal_html(html) %>% 
    html_element(".recentvictims") 

which gives me:

[1] <div >\n <div >\n <b>Date & time\n </div>\n ...
[2] <div >\n <div >\n <a href="/victim/4597/">\n ...
[3] <div >\n <div >\n <a href="/victim/4595/">\n ...

I want to grab all the info under classes "lfrow even" and "lfrow odd"

Any suggestions? Thank you

CodePudding user response:

To get your output into a data frame, I added as.data.frame() to your first piece of code, which created a data frame with one column named . and all the text separated by line breaks \n. I used the tidyr function separate() to convert this data into columns. To get the column names I used the strsplit() function to separate first row of data into a character vector. (This function produces a list, so the [[1]] extracts the first element of that list which is the required vector of column names.)

library(rvest)
library(tidyr)
library(dplyr)

html <- read_html(x = "https://homicides.news.baltimoresun.com/recent/")

data <- html_node(html, ".recentvictims") %>% 
  html_children() %>% 
  head() %>% 
  html_text2() %>% 
  as.data.frame

want <- data %>%
  filter(row_number()>1) %>% # first row has column names
  separate(col='.',sep="\\n",into=strsplit(data[1,1],'\\n')[[1]])

  • Related