I am performing webscraping on a site and have been able to get basic data, but I now need to collect data from a more complicated part of the page.
I am using rvest to pull data from the AAA gas prices website:
I am now trying to pull county-level data, which is only displayed on the map (if you hover your cursor over an individual county. I need to get the county gas prices for individual counties in different states. For example, if you click on Maine, to go to the Maine page (https://gasprices.aaa.com/?state=ME), I need to webscrape the price for Aroostook (the northernmost county on the map).
I have been able to use rvest to extract the data for the metro areas (lower on the page), using html_nodes
and the node "td". However, the code for the map is more complex. Instead of the simple "td" node, the developer tools (in Chrome) gives <td >$4.928</td
on the line with the price ($4.928 is the current price in Aroostook, as of the date of this post). I cannot seem to identify that with the rvest package to extract it.
I have read that the class can be used, or others have proposed using the css code to designate it within rvest, but I am unfamiliar with how to do so. Pulling the metro-area numbers was straightforward, however the county-level prices embedded within the map do not seem as accessible.
Is there a way to extract this county-level data so that I can webscrape in R? And, can this then be repeatable for all the counties/states from which I must select? Do I need the css code, and if so how do I access it/write it properly for rvest to use?
CodePudding user response:
It looks like the information you are looking for is store in the "index.php" file that gets downloaded when the web page loads.
The current link for Maine is "https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=89346&ver=5.9.3".
I am not sure what the r=89346 value is for, maybe a timestamp, tracking id, temporary token (to prevent web scraping) etc. I suspect this URL will change thus you may need to use the developer tools on the browser to obtain the current url.
Also, map_id refers to state but I don't know the rational, Florida is 1, NC is 35 and Maine is 21.
Download this file, then extract the JSON data and convert.
library(dplyr)
#read the index_php file and turn it into character string
index_php <-readLines("https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=19770&ver=5.9.3")
index_php <- paste(index_php, collapse = " ")
#extract out the correct JSON data part and convert
jsondata <- stringr::str_extract(index_php, "\\{\"st1\":. ?\\}\\}")
data<-jsonlite::fromJSON(jsondata)
#create a data frame with the results
answer <- bind_rows(data)
id name shortname link comment image color_map color_map_over
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Androscoggin "" "" $4.964 "" #ca3338 #ca3338
2 2 Aroostook "" "" $4.928 "" #dd7a7a #dd7a7a
3 3 Cumberland "" "" $4.944 "" #ca3338 #ca3338
4 4 Franklin "" "" $4.936 "" #dd7a7a #dd7a7a
5 5 Hancock "" "" $4.900 "" #01b5da #01b5da
6 6 Kennebec "" "" $4.955 "" #ca3338 #ca3338
There are some extra columns which need removal, I leave as an exercise for the reader.