How to extract webpage data with a node and class in rvest-CodePudding

I am performing webscraping on a site and have been able to get basic data, but I now need to collect data from a more complicated part of the page.

I am using rvest to pull data from the AAA gas prices website:

https://gasprices.aaa.com/

I am now trying to pull county-level data, which is only displayed on the map (if you hover your cursor over an individual county. I need to get the county gas prices for individual counties in different states. For example, if you click on Maine, to go to the Maine page (https://gasprices.aaa.com/?state=ME), I need to webscrape the price for Aroostook (the northernmost county on the map).

I have been able to use rvest to extract the data for the metro areas (lower on the page), using html_nodes and the node "td". However, the code for the map is more complex. Instead of the simple "td" node, the developer tools (in Chrome) gives <td >$4.928</td on the line with the price ($4.928 is the current price in Aroostook, as of the date of this post). I cannot seem to identify that with the rvest package to extract it.

I have read that the class can be used, or others have proposed using the css code to designate it within rvest, but I am unfamiliar with how to do so. Pulling the metro-area numbers was straightforward, however the county-level prices embedded within the map do not seem as accessible.

Is there a way to extract this county-level data so that I can webscrape in R? And, can this then be repeatable for all the counties/states from which I must select? Do I need the css code, and if so how do I access it/write it properly for rvest to use?

CodePudding user response：

It looks like the information you are looking for is store in the "index.php" file that gets downloaded when the web page loads. The current link for Maine is "https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=89346&ver=5.9.3".
I am not sure what the r=89346 value is for, maybe a timestamp, tracking id, temporary token (to prevent web scraping) etc. I suspect this URL will change thus you may need to use the developer tools on the browser to obtain the current url.
Also, map_id refers to state but I don't know the rational, Florida is 1, NC is 35 and Maine is 21.

Download this file, then extract the JSON data and convert.

library(dplyr)

#read the index_php file and turn it into character string
index_php <-readLines("https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=19770&ver=5.9.3")
index_php <- paste(index_php, collapse = " ")

#extract out the correct JSON data part and convert
jsondata <- stringr::str_extract(index_php, "\\{\"st1\":. ?\\}\\}") 
data<-jsonlite::fromJSON(jsondata)

#create a data frame with the results
answer <- bind_rows(data)

      id name         shortname link  comment image color_map color_map_over
   <int> <chr>        <chr>     <chr> <chr>   <chr> <chr>     <chr>         
 1     1 Androscoggin ""        ""    $4.964  ""    #ca3338   #ca3338       
 2     2 Aroostook    ""        ""    $4.928  ""    #dd7a7a   #dd7a7a       
 3     3 Cumberland   ""        ""    $4.944  ""    #ca3338   #ca3338       
 4     4 Franklin     ""        ""    $4.936  ""    #dd7a7a   #dd7a7a       
 5     5 Hancock      ""        ""    $4.900  ""    #01b5da   #01b5da       
 6     6 Kennebec     ""        ""    $4.955  ""    #ca3338   #ca3338

There are some extra columns which need removal, I leave as an exercise for the reader.