Home > Back-end >  How to extract webpage data with a node and class in rvest
How to extract webpage data with a node and class in rvest

Time:07-08

I am performing webscraping on a site and have been able to get basic data, but I now need to collect data from a more complicated part of the page.

I am using rvest to pull data from the AAA gas prices website:

https://gasprices.aaa.com/

I am now trying to pull county-level data, which is only displayed on the map (if you hover your cursor over an individual county. I need to get the county gas prices for individual counties in different states. For example, if you click on Maine, to go to the Maine page (https://gasprices.aaa.com/?state=ME), I need to webscrape the price for Aroostook (the northernmost county on the map).

I have been able to use rvest to extract the data for the metro areas (lower on the page), using html_nodes and the node "td". However, the code for the map is more complex. Instead of the simple "td" node, the developer tools (in Chrome) gives <td >$4.928</td on the line with the price ($4.928 is the current price in Aroostook, as of the date of this post). I cannot seem to identify that with the rvest package to extract it.

I have read that the class can be used, or others have proposed using the css code to designate it within rvest, but I am unfamiliar with how to do so. Pulling the metro-area numbers was straightforward, however the county-level prices embedded within the map do not seem as accessible.

Is there a way to extract this county-level data so that I can webscrape in R? And, can this then be repeatable for all the counties/states from which I must select? Do I need the css code, and if so how do I access it/write it properly for rvest to use?

CodePudding user response:

It looks like the information you are looking for is store in the "index.php" file that gets downloaded when the web page loads. The current link for Maine is "https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=89346&ver=5.9.3".
I am not sure what the r=89346 value is for, maybe a timestamp, tracking id, temporary token (to prevent web scraping) etc. I suspect this URL will change thus you may need to use the developer tools on the browser to obtain the current url.
Also, map_id refers to state but I don't know the rational, Florida is 1, NC is 35 and Maine is 21.

Download this file, then extract the JSON data and convert.

library(dplyr)

#read the index_php file and turn it into character string
index_php <-readLines("https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=19770&ver=5.9.3")
index_php <- paste(index_php, collapse = " ")

#extract out the correct JSON data part and convert
jsondata <- stringr::str_extract(index_php, "\\{\"st1\":. ?\\}\\}") 
data<-jsonlite::fromJSON(jsondata)

#create a data frame with the results
answer <- bind_rows(data)

      id name         shortname link  comment image color_map color_map_over
   <int> <chr>        <chr>     <chr> <chr>   <chr> <chr>     <chr>         
 1     1 Androscoggin ""        ""    $4.964  ""    #ca3338   #ca3338       
 2     2 Aroostook    ""        ""    $4.928  ""    #dd7a7a   #dd7a7a       
 3     3 Cumberland   ""        ""    $4.944  ""    #ca3338   #ca3338       
 4     4 Franklin     ""        ""    $4.936  ""    #dd7a7a   #dd7a7a       
 5     5 Hancock      ""        ""    $4.900  ""    #01b5da   #01b5da       
 6     6 Kennebec     ""        ""    $4.955  ""    #ca3338   #ca3338 

There are some extra columns which need removal, I leave as an exercise for the reader.

  • Related