I have a dataframe looks like below (the true data has many more people):
Year Player Club
2005 Phelan Chicago Fire
2007 Phelan Boston Pant
2008 Phelan Boston Pant
2010 Phelan Chicago Fire
2002 John New York Jet
2006 John New York Jet
2007 John Atlanta Elephant
2009 John Los Angeles Eagle
I want to calculate a player level measure (count) for each player that captures the weighted number of club that a person experienced. The formula is (length of the experience 1/total years)^2 (length of the experience 2/total years)^2 ......
Below is the ideal output for the sample data. For example, "count" for Phelan equals to (2/6)^2 (3/6)^2 (1/6)^2=0.389 (assuming in 2006, which is missing data, Phelan stayed in Chicago Fire. And assuming in 2009, which is missing data, Phelan stayed in Boston Pant).
Player Count
Phelan 0.389
John 0.469
CodePudding user response:
Here, I cut out repeat Player-Club yrs. This could give wrong results if a player's data ends with repeats of the same club, but I can't immediately think of a better approach.
df %>%
group_by(Player) %>%
filter(Club != lag(Club, default = "")) %>%
mutate(yrs = coalesce(Year - lag(Year), 1)) %>%
summarize(Count = sum((yrs / sum(yrs))^2))
A tibble: 2 × 2
Player Count
<chr> <dbl>
1 John 0.469
2 Phelan 0.389
CodePudding user response:
Here's another tidyverse
option (though a little more verbose than @JonSpring's nice answer), where I fill in the series for the missing years and assume that in the missing years the player stays at the same place. Then, I create unique groups for the consecutive values for each Player, then get the number of observations, and finally apply the formula.
library(tidyverse)
df %>%
group_by(Player) %>%
complete(., Year = full_seq(Year, period = 1)) %>%
fill(Club, .direction = "down") %>%
group_by(Player, grp = with(rle(Club), rep(seq_along(lengths), lengths))) %>%
summarise(club_sum = sum(n())) %>%
summarise(count = sum((club_sum / sum(club_sum)) ^ 2))
Output
Player count
<chr> <dbl>
1 John 0.469
2 Phelan 0.389