I am scraping OpenFDA (https://open.fda.gov/apis). I know my particular inquiry has 6974 hits, which is organized into 100 hits per page (max download of the API). I am trying to use R (rvest, jsonlite, purr, tidyverse, httr) to download all of this data.
I checked the website information with curl in terminal and downloaded a couple of sites to see a pattern.
I've tried a few lines of code and I can only get 100 entries to download. This code seems to work decently, but it will only pull 100 entries, so one page To skip the fisrt 100, which I can pull down and merge later, here is the code that I have used:
url_json <- "https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=100&skip=6973"
raw_json <- httr::GET(url_json, accept_json())
data<- httr::content(raw_json, "text")
my_content_from_json <- jsonlite::fromJSON(data)
dplyr::glimpse(my_content_from_json)
dataframe1 <- my_content_from_json$results
view(dataframe1)
SOLUTION below in the responses. Thanks!
CodePudding user response:
From the comments:
It looks like the API parameters skip
and limit
work better than the search_after
parameter. They allow pulling down 1,000 entries simultaneously according to the documentation (open.fda.gov/apis/query-parameters). To provide these parameters into the query string, an example URL would be
https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=1000&skip=0
after which you can loop to get the remaining entries with skip=1000
, skip=2000
, etc. as you've done above.