I have this loop in R that is scraping Reddit comments from an API incrementally on an hourly basis (e.g. all comments containing a certain keyword between now and 1 hour ago, 1 hour ago and 2 hours ago, 2 hours ago and 3 hours ago, etc.):
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
url_i<- paste0(part1, i 1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
In this loop, I added a line of code that prints which iteration the loop is currently on, and the cumulative number of results that the loop has scraped as of the current iteration. I also added a line of code ("tryCatch") that in the worst case scenario, forces this loop to skip an iteration which produces an error - however, I was not anticipating that to happen very often.
However, I am noticing that this loop is producing errors and often skipping iterations, far more than I had expected. For example (left column is the iteration number, right column is the cumulative results).
My guess is that between certain times, the API might not have recorded any comments that were left between those times thus not adding any new results to the list. E.g.
iteration_number cumulative_results
1 13432 5673
2 13433 5673
3 13434 5673
But in my case - can someone please help me understand why this loop is producing so many errors that is resulting in so many skipped iterations?
Thank you!
CodePudding user response:
Your issue is almost certainly caused by rate limits imposed on your requests by the Pushshift API. When doing scraping tasks like you are here, the server may track how many requests a client has made within a certain time interval (1) and choose to return an error code (HTTP 429) instead of the requested data. This is called rate limiting and is a way for web sites to limit abuse, charge customers for usage, or both.
Per this discussion about Pushshift on Reddit, it does look like Pushshift imposes a rate limit of 120 requests per minute (Also: See their /meta endpoint).
I was able to confirm that a script like yours will run into rate limiting by changing this line and re-running your script:
}, error = function(e){})
to:
}, error = function(e){ message(e) })
After a while, I got output like:
HTTP error 429
The solution is to slow yourself down in order to stay within this limit. A straightforward way to do this is add a call to Sys.sleep(1)
into your for loop, where 1
is the number of seconds to pause execution.
I modified your script as follows:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
Sys.sleep(1) # Changed. Change the value as needed.
url_i<- paste0(part1, i 1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){
message(e) # Changed. Prints to the console on error.
})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Note that you may have to try a number larger than 1 and you'll notice that your script takes longer to run.
CodePudding user response:
As the error seems to come randomly, depending on API availability, you could retry, and set a maxattempt
number:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
maxattempt <- 3
for (i in 1:1000)
{
attempt <- 1
r_i <- NULL
while( is.null(r_i) && attempt <= maxattempt ) {
if (attempt>1) {print(paste("retry: i =",i))}
attempt <- attempt 1
url_i<- paste0(part1, i 1, part2, i, part3)
try({
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i))) })
}
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Result:
[1] 1 250
[1] 2 498
[1] 3 748
[1] 4 997
[1] 5 1247
[1] 6 1497
[1] 7 1747
[1] 8 1997
[1] 9 2247
[1] 10 2497
...
[1] 416 101527
[1] 417 101776
Error in open.connection(con, "rb") :
Timeout was reached: [api.pushshift.io] Operation timed out after 21678399 milliseconds with 0 out of 0 bytes received
[1] "retry: i = 418"
[1] 418 102026
[1] 419 102276
...
On above example, a retry occured for i = 418
.
Note that for i in 1:500
, results size is already 250Mb, meaning that you could expect 25Gb for i in 1:50000
: do you have enough RAM?