Home > Blockchain >  R Dataframe Weighted Averages by Group
R Dataframe Weighted Averages by Group

Time:05-06

I have a dataframe looks like below (the true data has many more people):

Year   Player    Club
2005   Phelan    Chicago Fire 
2007   Phelan    Boston Pant
2008   Phelan    Boston Pant
2010   Phelan    Chicago Fire  
2002   John      New York Jet
2006   John      New York Jet
2007   John      Atlanta Elephant
2009   John      Los Angeles Eagle

I want to calculate a player level measure (count) for each player that captures the weighted number of club that a person experienced. The formula is (length of the experience 1/total years)^2 (length of the experience 2/total years)^2 ......

Below is the ideal output for the sample data. For example, "count" for Phelan equals to (2/6)^2 (3/6)^2 (1/6)^2=0.389 (assuming in 2006, which is missing data, Phelan stayed in Chicago Fire. And assuming in 2009, which is missing data, Phelan stayed in Boston Pant).

Player    Count
Phelan    0.389
John      0.469

CodePudding user response:

Here, I cut out repeat Player-Club yrs. This could give wrong results if a player's data ends with repeats of the same club, but I can't immediately think of a better approach.

df %>%
  group_by(Player) %>%
  filter(Club != lag(Club, default = "")) %>%
  mutate(yrs = coalesce(Year - lag(Year), 1)) %>%
  summarize(Count = sum((yrs / sum(yrs))^2))


 A tibble: 2 × 2
  Player Count
  <chr>  <dbl>
1 John   0.469
2 Phelan 0.389

CodePudding user response:

Here's another tidyverse option (though a little more verbose than @JonSpring's nice answer), where I fill in the series for the missing years and assume that in the missing years the player stays at the same place. Then, I create unique groups for the consecutive values for each Player, then get the number of observations, and finally apply the formula.

library(tidyverse)

df %>%
  group_by(Player) %>%
  complete(., Year = full_seq(Year, period = 1)) %>%
  fill(Club, .direction = "down") %>%
  group_by(Player, grp = with(rle(Club), rep(seq_along(lengths), lengths))) %>%
  summarise(club_sum = sum(n())) %>%
  summarise(count = sum((club_sum / sum(club_sum)) ^ 2))

Output

  Player count
  <chr>  <dbl>
1 John   0.469
2 Phelan 0.389
  • Related