Home > Net >  R: Comparing Cumulative Averages
R: Comparing Cumulative Averages

Time:01-31

I am working with the R programming language.

I have a dataset of students (id) who took an exam (var_1 = 1 = pass, var_1 = 0 = fail) at different dates:

library(dplyr)

my_data = data.frame(id = c(1,1,1,1,2,2,2,3,4,4,5,5,5,5,5),
 year = c(2010,2011,2012,2013, 2011, 2012, 2013, 2013, 2014, 2015, 2015, 2016, 2017, 2018, 2019),var_1 = c(1 ,1 ,0 ,1 ,1 ,0 ,0 ,0, 1, 0, 1, 1, 0, 1, 1))

My Question: I want to create a dataset that contains each student's average cumulative exam score until the most recent exam, and the results of the most recent exam (e.g. so in the future, we can try to predict the results of the next exam based on the cumulative average).

I wrote the following R code to do this (in a very inefficient way):

# create a grouped rank variable (per ID) to track the order of exams 
my_data = my_data %>% arrange(id, year) %>%
    group_by(id) %>% 
    mutate(rank = rank(-year))

# separate the most recent exams (int_file_1) and all other exams (int_file_2) 
int_file_1 = my_data[my_data$rank == 1,]
int_file_2 = my_data[my_data$rank != 1,]

# for each student, calculate cumulative averages
int_file_3  = int_file_2  %>% group_by(id) %>% mutate(Cumulative_Mean = cummean(var_1))


# for each student, select the row corresponding to last exam (this is actually second last exam in whole dataset)
int_file_4 = int_file_3 %>% 
   group_by(id) %>% 
   slice_min(order_by = rank)


# join previous files together
join = merge(x = int_file_4, y = int_file_1, by = "id", all.x = TRUE)

# drop unnecessary columns and rename columns
colnames(join)[1] <- "id"
colnames(join)[2] <- "second_most_recent_year"
colnames(join)[3] <- "second_most_recent_exam"
colnames(join)[5] <- "cumulative_exam_score_up_to_second_most_recent_exam"
colnames(join)[7] <- "most_recent_exam_score"

join = join[,c(1,2,3,5,7)]

I think this code is giving me the intended results:

 head(join)
  id second_most_recent_year second_most_recent_exam cumulative_exam_score_up_to_second_most_recent_exam most_recent_exam_score
1  1                    2012                       0                                           0.6666667                      1
2  2                    2012                       0                                           0.5000000                      0

But can someone please show me a more efficient way to do this?

Thanks!

Note: My code is designed to ignore students who have only take 1 exam

CodePudding user response:

We could do this in a single chain as well

library(dplyr)
library(tidyr)
my_data %>%
   group_by(id) %>%
   mutate(rank = rank(-year)) %>%
   group_by(grp = 1   (rank != 1), .add = TRUE) %>% 
  mutate(Cumulative_Mean = cummean(var_1)) %>%
  slice_min(order_by = rank) %>% 
  ungroup %>%
  mutate(grp = c("most_recent", "second_most_recent")[grp]) %>% 
  select(-var_1) %>%
  pivot_wider(names_from = grp,
    values_from = c(year, rank, Cumulative_Mean), 
     names_glue = "{grp}_{.value}") %>%
   drop_na() %>%
   select(id, starts_with('second'), everything())

-output

# A tibble: 4 × 7
     id second_most_recent_year second_most_recent_rank second_most_recent_Cumulative_Mean most_recent_year most_recent_rank most_recent_Cumulative_Mean
  <dbl>                   <dbl>                   <dbl>                              <dbl>            <dbl>            <dbl>                       <dbl>
1     1                    2012                       2                              0.667             2013                1                           1
2     2                    2012                       2                              0.5               2013                1                           0
3     4                    2014                       2                              1                 2015                1                           0
4     5                    2018                       2                              0.75              2019                1                           1

CodePudding user response:

Here's another option:

my_data %>%
  arrange(id, year) %>%
  group_by(id) %>%
  filter(n() > 1) %>%
  mutate(
    is_last = row_number() == n(),
    is_second_last = row_number() == (n() - 1), 
  ) %>%
  summarize(
    second_last_score = var_1[is_second_last],
    second_last_year = year[is_second_last],
    mean_score_through_second_last = mean(var_1[!is_last]),
    last_score = var_1[is_last],
    .groups = "drop"
  )
# # A tibble: 4 × 5
#      id second_last_score second_last_year mean_score_through_second_last last_score
#   <dbl>             <dbl>            <dbl>                          <dbl>      <dbl>
# 1     1                 0             2012                          0.667          1
# 2     2                 0             2012                          0.5            0
# 3     4                 1             2014                          1              0
# 4     5                 1             2018                          0.75           1

Conceptually, I'd say "cumulative mean" is a bad framing here. The point of a cumulative mean is to give the mean of the first observation, and then the first and second, and then the first and second and third, ... but you don't want all of those means. You only want a single mean of all but the last observation. So I do that directly by flagging the last observation and using mean(var_1[!is_last]).

  •  Tags:  
  • r
  • Related