I have a dataframe looks like below (the true data has many more people):
Year Player Club
2005 Phelan Chicago Fire
2007 Phelan Boston Pant
2008 Phelan Boston Pant
2010 Phelan Chicago Fire
2002 John New York Jet
2006 John New York Jet
2007 John Atlanta Elephant
2009 John Los Angeles Eagle
I want to calculate a player level measure (count) for each row (year) that captures the weighted number of club that a person experienced up to that point. The formula is (length of the experience 1/total years up to that point)^2 (length of the experience 2/total years up to that point)^2 ......
Below is the ideal output for Phelan. For example, "count" for his first row is 1 as it is his first year in the data and (1/1)^2=1. For his second row, which includes three years (2005, 2006, 2007) up to this point, count=(1/3)^2 (2/3)=0.56 (assuming in 2006, which is missing data, Phelan also stayed in Chicago Fire). For his third row, count=(2/4)^2 (2/4)^2=0.5. For his fourth row, count=(3/6)^2 (3/6)^2=0.5 (assuming in 2009, which is missing data, Phelan also stayed in Boston Pant).
Year Player Club Count
2005 Phelan Chicago Fire 1
2007 Phelan Boston Pant 0.56
2008 Phelan Boston Pant 0.5
2010 Phelan Chicago Fire 0.5
CodePudding user response:
This is a bit convoluted but I think it does what you want.
Using data.table
:
library(data.table)
library(zoo) # for na.locf(...)
##
expand.df <- setDT(df)[, .(Year=min(Year):max(Year)), by=.(Player)]
expand.df[df, Club:=i.Club, on=.(Player, Year)]
expand.df[, Club:=na.locf(Club)]
expand.df[, cuml.exp:=1:.N, by=.(Player)]
expand.df <- expand.df[expand.df[, .(Player, cuml.exp)], on=.(Player, cuml.exp <= cuml.exp)]
expand.df <- expand.df[, .(Year=max(Year), club.exp=sum(sapply(unique(Club), \(x) sum(Club==x)^2))), by=.(Player, cuml.exp)]
expand.df[, score:=club.exp/cuml.exp^2]
result <- expand.df[df, on=.(Player, Year), nomatch=NULL]
result[, .(Player, Year, Club, cuml.exp, club.exp, score)]
## Player Year Club cuml.exp club.exp score
## 1: Phelan 2005 Chicago Fire 1 1 1.0000000
## 2: Phelan 2007 Boston Pant 3 5 0.5555556
## 3: Phelan 2008 Boston Pant 4 8 0.5000000
## 4: Phelan 2010 Chicago Fire 6 18 0.5000000
## 5: John 2002 New York Jet 1 1 1.0000000
## 6: John 2006 New York Jet 5 25 1.0000000
## 7: John 2007 Atlanta Elephant 6 26 0.7222222
## 8: John 2009 Los Angeles Eagle 8 30 0.4687500
So this expands your df
to include one row per year per player, then joins back the clubs for the appropriate years, then fills the gaps per your description. Then we calculate cumulative years of experience for each player.
The next bit is the convoluted part: we need to expand further so that for each combination of player and cuml.exp we have all the rows up to that point. The join on=.(Player, cuml.exp <= cuml.exp)
does that. Then we can count the number of instances of each club by player and cuml.exp to get the numerator of your score.
Then we calculate the scores, drop the extra years and the extra columns.
Note that this assumes you've got R 4.1 . If not, replcae \(x)...
with function(x)...
.