Home > other >  Correctly Using "ClusterEval" in R?
Correctly Using "ClusterEval" in R?

Time:12-16

I am working with the R programming language.

I have this dataset that records exam results ( 1 = pass, 0 = fail) for a set of students at different times:

library(data.table)
library(doParallel)

# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

# Create a data frame from the sample data
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

# Convert the data frame to a data.table
my_data = setDT(my_data)

# Create a cluster with 4 workers
cl = makeCluster(4)

I have this function that tracks the number of times each student failed an exam given the student failed the previous exam, passed an exam given that the student passed the previous exam, passed an exam given that the student failed the previous exam and failed an exam given that the student passed the previous exam. Here is the function:

my_function <- function(i) {
    # Use tryCatch to handle the case where there are no rows in the start_i data frame
    tryCatch({
        start_i = my_data[my_data$id == i,]
        pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
        frame_i =  as.data.frame(table(pairs_i))
        frame_i$i = i
        return(frame_i)
    }, error = function(err) {
        # Return an empty data frame if there are no rows in the start_i data frame
        return(data.frame())
    })
}

Now, I would like to try and run this function on my data in parallel - that is, I would like to assign data belonging to different students to different cores within my computer, in an effort to accelerate the time required to perform this function. Here is my attempt:

# Export the data frames and the my_function to the workers on the cluster
clusterExport(cl, c("my_data", "my_function", "data.table"))

# Assign each worker a different subset of the data to work on
clusterSetRNGStream(cl)
n = nrow(my_data)
chunks = rep(1:4, each = n / 4)
my_data = my_data[chunks == 1,]

# Evaluate the code on the cluster (final_out is the final result)
final_out = parLapply(cl, unique(my_data$id), my_function)

# alternate version
final_out = clusterApply(cl, unique(my_data$id), my_function)


# Stop the cluster when finished
stopCluster(cl)

The code seems to have run without errors - but I am not sure if I have done everything correctly.

Can someone please comment on this?

Thanks!

CodePudding user response:

So far as I can tell, the approach you've taken does what you expect. I am doubtful that the cluster is giving you any real speed improvement over other alternative methods. For example, if you use a dplyr pipeline, you could do it pretty easily:

out <- my_data %>% 
  arrange(id, exam_number) %>% 
  group_by(id) %>% 
  mutate(prev_exam = lag(results)) %>% 
  group_by(id, results, prev_exam) %>% 
  tally() %>% 
  na.omit()

On my machine, macOS 12.6, 3.6 GHz intel i9, 128GB RAM, the dplyr pipeline is about 3.5 times faster than the parallel approach. As @jblood94 said in his comment, the considerable resources in communication make the cluster solution pretty inefficient. Maybe there is an even better datatable solution.

  • Related