I am trying to scrape data from https://homicides.news.baltimoresun.com/recent/ using rvest and put information on victims into a data table or frame.
What I have so far is:
html <- read_html(x = "https://homicides.news.baltimoresun.com/recent/")
html_node(html, ".recentvictims") %>%
html_children() %>%
head() %>%
html_text2()
which gives me a list of the information, but I can't find a way to put this into a data frame.
[1] "Date & time\nVictim name\nAddress\nAge\nGender\nRace"
[2] "09/26/2022 7:15 p.m.\n\n1900 Griffis Ave\n—\nMale\nUnknown"
[3] "09/21/2022 1:45 p.m.\nKelly Logan\n2100 Kloman St\n53\nFemale\nBlack"
[4] "09/20/2022 9:00 a.m.\nDelon Bushrod\n2800 Bookert Dr\n24\nMale\nBlack"
[5] "09/19/2022 8:06 p.m.\nTerry Gordon\n1600 N Wolfe St\n53\nMale\nBlack"
[6] "09/16/2022 9:43 a.m.\nDelanie McCloud\n100 Wilmott Court\n37\nMale\nBlack"
I've also tried selecting the html elements under ".recentelements"
minimal_html(html) %>%
html_element(".recentvictims")
which gives me:
[1] <div >\n <div >\n <b>Date & time\n </div>\n ...
[2] <div >\n <div >\n <a href="/victim/4597/">\n ...
[3] <div >\n <div >\n <a href="/victim/4595/">\n ...
I want to grab all the info under classes "lfrow even" and "lfrow odd"
Any suggestions? Thank you
CodePudding user response:
To get your output into a data frame, I added as.data.frame()
to your first piece of code, which created a data frame with one column named .
and all the text separated by line breaks \n
. I used the tidyr
function separate()
to convert this data into columns. To get the column names I used the strsplit()
function to separate first row of data into a character vector. (This function produces a list, so the [[1]]
extracts the first element of that list which is the required vector of column names.)
library(rvest)
library(tidyr)
library(dplyr)
html <- read_html(x = "https://homicides.news.baltimoresun.com/recent/")
data <- html_node(html, ".recentvictims") %>%
html_children() %>%
head() %>%
html_text2() %>%
as.data.frame
want <- data %>%
filter(row_number()>1) %>% # first row has column names
separate(col='.',sep="\\n",into=strsplit(data[1,1],'\\n')[[1]])