R Dataframe Rolling Measures by Group-CodePudding

I have a dataframe looks like below (the true data has many more people):

Year   Player    Club
2005   Phelan    Chicago Fire 
2007   Phelan    Boston Pant
2008   Phelan    Boston Pant
2010   Phelan    Chicago Fire  
2002   John      New York Jet
2006   John      New York Jet
2007   John      Atlanta Elephant
2009   John      Los Angeles Eagle

I want to calculate a player level measure (count) for each row (year) that captures the weighted number of club that a person experienced up to that point. The formula is (length of the experience 1/total years up to that point)^2 (length of the experience 2/total years up to that point)^2 ......

Below is the ideal output for Phelan. For example, "count" for his first row is 1 as it is his first year in the data and (1/1)^2=1. For his second row, which includes three years (2005, 2006, 2007) up to this point, count=(1/3)^2 (2/3)=0.56 (assuming in 2006, which is missing data, Phelan also stayed in Chicago Fire). For his third row, count=(2/4)^2 (2/4)^2=0.5. For his fourth row, count=(3/6)^2 (3/6)^2=0.5 (assuming in 2009, which is missing data, Phelan also stayed in Boston Pant).

Year   Player    Club            Count
2005   Phelan    Chicago Fire    1
2007   Phelan    Boston Pant     0.56
2008   Phelan    Boston Pant     0.5
2010   Phelan    Chicago Fire    0.5

CodePudding user response：

This is a bit convoluted but I think it does what you want.

Using data.table:

library(data.table)
library(zoo)          # for na.locf(...)
##
expand.df <- setDT(df)[, .(Year=min(Year):max(Year)), by=.(Player)]
expand.df[df, Club:=i.Club, on=.(Player, Year)]
expand.df[, Club:=na.locf(Club)]
expand.df[, cuml.exp:=1:.N, by=.(Player)]
expand.df <- expand.df[expand.df[, .(Player, cuml.exp)], on=.(Player, cuml.exp <= cuml.exp)]
expand.df <- expand.df[, .(Year=max(Year), club.exp=sum(sapply(unique(Club), \(x) sum(Club==x)^2))), by=.(Player, cuml.exp)]
expand.df[, score:=club.exp/cuml.exp^2]
result <- expand.df[df, on=.(Player, Year), nomatch=NULL]
result[, .(Player, Year, Club, cuml.exp, club.exp, score)]
##    Player Year              Club cuml.exp club.exp     score
## 1: Phelan 2005      Chicago Fire        1        1 1.0000000
## 2: Phelan 2007       Boston Pant        3        5 0.5555556
## 3: Phelan 2008       Boston Pant        4        8 0.5000000
## 4: Phelan 2010      Chicago Fire        6       18 0.5000000
## 5:   John 2002      New York Jet        1        1 1.0000000
## 6:   John 2006      New York Jet        5       25 1.0000000
## 7:   John 2007  Atlanta Elephant        6       26 0.7222222
## 8:   John 2009 Los Angeles Eagle        8       30 0.4687500

So this expands your df to include one row per year per player, then joins back the clubs for the appropriate years, then fills the gaps per your description. Then we calculate cumulative years of experience for each player.

The next bit is the convoluted part: we need to expand further so that for each combination of player and cuml.exp we have all the rows up to that point. The join on=.(Player, cuml.exp <= cuml.exp) does that. Then we can count the number of instances of each club by player and cuml.exp to get the numerator of your score.

Then we calculate the scores, drop the extra years and the extra columns.

Note that this assumes you've got R 4.1 . If not, replcae \(x)... with function(x)....