I need for a project of mine to unshorten a long list of URLs ( 300k).
I attach the code below:
# This is the function through which I extract an URL and check whether they are
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
text_tw <- tweet
locunshort2 <- NA # starting value is NA
indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
if(length(indtext)>0){
for (i in indtext){
skip_to_next <- FALSE
url <- unlist(strsplit(text_tw, " "))[i]
tryCatch({
r <- httr::HEAD(url, timeout(30)) # open the link
print(r$url)
if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
}
}, error = function(e) {
skip_to_next <- TRUE
locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
})
if(skip_to_next) { next }
}}
if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
}
return(locunshort2)}
I tried to use the parLapply function as follows:
numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)
The df$url column is just an initialized empty column, whereas df$text are the tweets I am analizing (thus strings of text - unprocessed/raw).
I have two issues with this code:
The parLapply function seems not to work as intended - i.e. if I run it on my first 10 observations I only get FALSE values, whereas if I just use a standard lapply, I get many TRUE values as well.
Even if we manage to fix the point above, the process with parLapply is very slow. For a similar project in Python I managed to open multiple threads (I could open 1000 observations per time through the ThreadPool function). Since I am forced to use R, I was wondering if there existed a similar alternative (I tried to look for it unsuccessfully).
Thank you a lot for your support.
P.S.: I am working on Windows
UPDATE
tweet is supposed to be a string of text with an URL usually inside it.
Example:
"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"
The function exturl finds the position of the url and unshortens it. If there are multiple URLs it unshorten one at a time. The end goal is to have a final value returned being TRUE if the URL is a redirect towards an external website or if it is just pointing at another tweet (I'm pretty positive about the functioning of exturl, and it works with lapply as I said, however when using parLapply something doesn't work anymore).
The dataset itself has just a column containing these strings.
UPDATE 2:
I tried to run a much more simplified version of the exturl function, to just extract a list of unshortened URLs, and I even tried to run it with a dataframe with just 1 observation (so technically the parLapply isn't really needed). No success, I'm convinced something in parLapply is messing up.
Below the simplified exturl:
ext_url <- function(tweet){
text_tw <- tweet
full_url <- list() # initialize an empty list to hold the URLs
indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
if(length(indtext)>0){
for (i in indtext){
skip_to_next <- FALSE
url <- unlist(strsplit(text_tw, " "))[i]
tryCatch({
r <- httr::HEAD(url, timeout(5)) # open the link
full_url <- append(full_url, r$url)
}, error = function(e) {
skip_to_next <- TRUE
full_url <- append(full_url, 'Error')
})
if(skip_to_next) { next }
}}
return(full_url)}
CodePudding user response:
Not sure, if this is an R problem: doesn't your code throw a 300k HTTP-requests unthrottled(?) at the receiving end, from multiple threads (parallelized version) but a single IP? If so, the server might deny what it maybe only just accepts from the slower (lapply
) version.
CodePudding user response:
I sort of solved the problem by switching to the foreach
package
df$url <- foreach(d = iter(df$text, by = 'row'), .combine = rbind, .packages = c("dplyr","stringr","httr")) %dopar% {
exturl(d)
}
It is not as fast as what I could achieve with Python, yet it is much faster than using the simple lapply
. It remains to be understood why parLapply
gave me such problems, however as far as I am concerned, problem solved!