Home > Software design >  Multithreading in R
Multithreading in R

Time:03-01

I need for a project of mine to unshorten a long list of URLs ( 300k).

I attach the code below:

# This is the function through which I extract an URL and check whether they are 
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
  text_tw <- tweet
  locunshort2 <- NA # starting value is NA
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(30)) # open the link
        print(r$url)
        if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
          locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
        }
      }, error = function(e) {
        skip_to_next <- TRUE
        locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
      })
      
      if(skip_to_next) { next }
    }}
  if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
    locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
  }
  return(locunshort2)}

I tried to use the parLapply function as follows:

numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)

The df$url column is just an initialized empty column, whereas df$text are the tweets I am analizing (thus strings of text - unprocessed/raw).

I have two issues with this code:

  1. The parLapply function seems not to work as intended - i.e. if I run it on my first 10 observations I only get FALSE values, whereas if I just use a standard lapply, I get many TRUE values as well.

  2. Even if we manage to fix the point above, the process with parLapply is very slow. For a similar project in Python I managed to open multiple threads (I could open 1000 observations per time through the ThreadPool function). Since I am forced to use R, I was wondering if there existed a similar alternative (I tried to look for it unsuccessfully).

Thank you a lot for your support.

P.S.: I am working on Windows

UPDATE

tweet is supposed to be a string of text with an URL usually inside it.

Example:

"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"

The function exturl finds the position of the url and unshortens it. If there are multiple URLs it unshorten one at a time. The end goal is to have a final value returned being TRUE if the URL is a redirect towards an external website or if it is just pointing at another tweet (I'm pretty positive about the functioning of exturl, and it works with lapply as I said, however when using parLapply something doesn't work anymore).

The dataset itself has just a column containing these strings.

UPDATE 2:

I tried to run a much more simplified version of the exturl function, to just extract a list of unshortened URLs, and I even tried to run it with a dataframe with just 1 observation (so technically the parLapply isn't really needed). No success, I'm convinced something in parLapply is messing up.

Below the simplified exturl:

ext_url <- function(tweet){
  text_tw <- tweet
  full_url <- list() # initialize an empty list to hold the URLs
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(5)) # open the link
        full_url <- append(full_url, r$url)
      }, error = function(e) {
        skip_to_next <- TRUE
        full_url <- append(full_url, 'Error')
      })
      
      if(skip_to_next) { next }
    }}
  return(full_url)}

CodePudding user response:

Not sure, if this is an R problem: doesn't your code throw a 300k HTTP-requests unthrottled(?) at the receiving end, from multiple threads (parallelized version) but a single IP? If so, the server might deny what it maybe only just accepts from the slower (lapply) version.

CodePudding user response:

I sort of solved the problem by switching to the foreach package

df$url <- foreach(d = iter(df$text, by = 'row'), .combine = rbind, .packages = c("dplyr","stringr","httr")) %dopar% {
      exturl(d)
    }

It is not as fast as what I could achieve with Python, yet it is much faster than using the simple lapply. It remains to be understood why parLapply gave me such problems, however as far as I am concerned, problem solved!

  • Related