Issue extracting Billboard data from Billboard Charts in R-CodePudding

This is actually quite easy if you know about HTML (I do not). If you look at this website, you will see how to extract data from billboard.com with respect to a specific week. An example can be seen as:


hot100page <- 'https://www.billboard.com/charts/hot-100/2022-03-19/'

hot100 <- xml2::read_html(hot100page)

# Extarct Rank of the song
rank <- hot100 %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>% 
  rvest::html_text()

When I run this, the rank is NULL and the list with hot100 does not have any information. Can you please assist?

CodePudding user response：

We can get the rank by using class only using rvest by,

library(rvest)
hot100page <- 'https://www.billboard.com/charts/hot-100/2022-03-19/'
hot100 <- rvest::read_html(hot100page)
hot100 %>% html_nodes('.o-chart-results-list-row-container') %>% html_nodes('.a-font-primary-bold-l') %>% 
   html_text2()
  [1] "1"   "1"   "1"   "60"  "1"   "1"   "60"  "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19" 
 [26] "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44" 
 [51] "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69" 
 [76] "70"  "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94" 
[101] "95"  "96"  "97"  "98"  "99"  "100"

CodePudding user response：

I would call tibble on the rows and map out title with rank as follows. Your xpath doesn't appear to match anything on the webpage and the specified class does not appear to exist in return html. You can cleanly extract rank from data-detail-target attribute in each listing row; title can easily come from elements with class c-title amongst the multi-valued className:

library(dplyr)
library(rvest)

hot100page <- "https://www.billboard.com/charts/hot-100/2022-03-19/"  
hot100 <- read_html(hot100page) 
rows <- hot100 %>% html_elements(".chart-results-list .o-chart-results-list-row")

result <- tibble(
  rank = rows %>% html_attr("data-detail-target") %>% as.integer(),
  title = rows %>% html_elements(".c-title") %>% html_text(trim = T)
)