Creating column that takes mean of X for specific group id when date1

in (R) for an event study I'm trying to create a column that calculates the mean of ccu_avg for a specific combination of appid and Eventdate1. One appid has multiple events so it has to be divided by both appid and Eventdate1. The difficult thing here is that I want it to calculate the mean only up until the event date since after the event happened the estimation period stops

The new column should look like est_ccu_avg: picture of the dataset below for explanation

https://i.stack.imgur.com/ZPquW.png

Could someone help me figure the code for this out? I've been trying for hours and can't seem to get it to work.

I've now been trying things like this but without success:

study <- study %>%
  mutate(est_ccu_avg=
                mean(study[unique(study$appid) | study$Eventdate1 > 
study$datefinal, "ccu_avg"])
            
            )

Result of dput head:

structure(list(appid = c("105600", "105600", "105600", "105600", 
"105600", "105600"), name = c("Terraria", "Terraria", "Terraria", 
"Terraria", "Terraria", "Terraria"), ccu_avg = c(26825, 29058, 
37842, 37525, 26484, 24377), ccu_min = c(21176, 21620, 28954, 
32880, 19648, 19118), ccu_max = c(35827, 41322, 50012, 44071, 
33241, 32060), pos_max = c(356186, 356363, 356508, 356712, 356921, 
357092), neg_max = c(6756, 6756, 6758, 6768, 6766, 6768), Maj_Upt = 
c(0, 
0, 0, 0, 0, 0), Min_Upt = c(0, 0, 0, 0, 0, 0), Hotfix = c(0, 
0, 0, 0, 0, 0), Bugfix = c(0, 0, 0, 0, 0, 0), Balance = c(0, 
0, 0, 0, 0, 0), ExpBranch = c(0, 0, 0, 0, 0, 0), Promo = c(0, 
1, 0, 0, 0, 0), Ev_Out = c(0, 0, 0, 0, 0, 0), Ev_In = c(0, 0, 
0, 0, 0, 0), isfree = c(0, 0, 0, 0, 0, 0), developers1 = c("Re- 
Logic", 
"Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic"), 
publishers1 = c("Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic", 
"Re-Logic", "Re-Logic"), metascore = c(83, 83, 83, 83, 83, 
83), singleplayer = c(1, 1, 1, 1, 1, 1), multiplayer = c(1, 
1, 1, 1, 1, 1), coop = c(1, 1, 1, 1, 1, 1), mmo = c(0, 0, 
0, 0, 0, 0), indie = c(1, 1, 1, 1, 1, 1), single_player_gen = c(0, 
0, 0, 0, 0, 0), adventure = c(1, 1, 1, 1, 1, 1), casual = c(0, 
0, 0, 0, 0, 0), strategy = c(0, 0, 0, 0, 0, 0), rpg = c(1, 
1, 1, 1, 1, 1), simulation = c(0, 0, 0, 0, 0, 0), multi_player_gen = 
c(0, 
0, 0, 0, 0, 0), shooter = c(0, 0, 0, 0, 0, 0), platformer = c(0, 
0, 0, 0, 0, 0), ea_min = c(0, 0, 0, 0, 0, 0), ea_max = c(0, 
0, 0, 0, 0, 0), scifi = c(0, 0, 0, 0, 0, 0), sports = c(0, 
0, 0, 0, 0, 0), racing = c(0, 0, 0, 0, 0, 0), inappurchase = c(0, 
0, 0, 0, 0, 0), workshop = c(0, 0, 0, 0, 0, 0), f_release_date = 
c("May 16, 2011", 
"May 16, 2011", "May 16, 2011", "May 16, 2011", "May 16, 2011", 
"May 16, 2011"), l_release_date = c("May 16, 2011", "May 16, 2011", 
"May 16, 2011", "May 16, 2011", "May 16, 2011", "May 16, 2011"
), datefinal = structure(c(18942, 18943, 18944, 18945, 18946, 
18947), class = "Date"), Eventdate = c("", "", "", "", "", 
""), Eventdate1 = structure(c(18949, 18949, 18949, 18949, 
18949, 18949), class = "Date"), est_ccu_avg = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

CodePudding user response：

I figured it out, there probably is an easier way but this is how I did it:

# first make a list with only the rows where eventdate > datefinal to only 
include estimation period.
estmeans <- study[study$Eventdate1 > study$datefinal,]

# calculate means per appid and eventdate
studymeans <- aggregate(estmeans$ccu_avg, list(estmeans$appid, 
estmeans$Eventdate1), mean)

# change the names for merging 
names(studymeans)[1] <- 'appid'
names(studymeans)[2] <- 'Eventdate1'
names(studymeans)[3] <- 'est_ccu_avg'

# merge the dataframes, it creates 2 new columns, delete the empty one.
studynew <- merge(study, studymeans, by=c("appid", "Eventdate1"))
studynew$est_ccu_avg.x <- NULL

CodePudding user response：

You can leverage the special .BY, to refer to the grouping variable, when using data.table

library(data.table)
setDT(df)[, mean(ccu_avg[datefinal<=.BY$Eventdate1]), by=.(appid, Eventdate1)]

The equivalent in dplyr is cur_group().

df %>%
  group_by(appid,Eventdate1) %>%
  summarize(res = mean(ccu_avg[datefinal<=cur_group()$Eventdate1))