R Dataframe Cross-Column Group-Level Summary-CodePudding

I have a dataframe like below (the real data has many more people and locations):

Year   Player    Location
2005   Phelan    Chicago 
2007   Phelan    Boston 
2008   Phelan    Boston 
2010   Phelan    Chicago  
2011   Phelan    Boston  
        
2002   John      New York 
2006   John      New York 
2007   John      Boston 
2009   John      Chicago

I want to calculate a location level measure but still consider the player-level information. Specifically, I want to calculate a weighted measure for each location that describes how many different previous locations lead to the current location.

For example, for Chicago, Phelan came here in 2005 and then left. He returned in 2010 after staying in Boston. John came to Chicago in 2009 after staying in New York and Boston. Noting that a person can leave and return to a same place (like Chicago for Phelan) and for such cases, I only want to consider the most recent stay to avoid double/multiple counting.

Before Phelan's most recent stay at Chicago, he stayed in Chicago for 2 years (2005-2006, assuming 2006 Phelan stayed in Chicago), Boston for 3 years (2007-2009, assuming 2009 Phelan stayed in Boston). Before John's most recent stay at Chicago, he stayed in New York for 5 years (2002-2006), Boston for 2 years (2007-2008). For Chicago, based on Phelan and John's previous experience in Chicago, it accumulates 2 3 5 2=12 years of previous experience. Among these 12 years, 2 years are from Chicago, 5 years are from Boston, and 5 years are from New York. We can then calculate the measure for Chicago by (2/12)^2 (5/12)^2 (5/12)^2=0.375. This number is the weighted measure for Chicago that describes how many different previous locations lead to Chicago.

For Boston, before Phelan's most recent stay at Boston, he stayed in Chicago for 3 years (2005-2008,2009), Boston for 3 years (2007-2009). Before John's most recent stay at Boston, he stayed in New York for 5 years (2002-2006). We can then calculate the measure for New York by (3/11)^2 (3/11)^2 (5/11)^2=0.355.

Below is the sample output:

Location       Weighted Measure
Chicago        0.375
Boston         0.355
New York       NA

CodePudding user response：

Here is a rather verbose approach using tidyr, dplyr and data.table::rleid

Create a more complete version of df, called df_complete that fills in any missing (assumed years), and tags the final visit for each player in each location

library(tidyr); library(dplyr); library(data.table)

df_complete = df %>% 
  group_by(Player) %>%
  complete(Year = seq(min(Year), max(Year),1)) %>%
  fill(Location) %>% 
  mutate(tag = data.table::rleid(Location)) %>%
  group_by(Player,Location) %>%
  mutate(tag=max(tag)==tag) %>% 
  ungroup()

Next, for each unique location in the original frame, df, we use lapply() to:

restrict, by player, to rows where there is experience prior to coming to that location
count the years by the prior locations
estimate the weighted measure (wt), equal to NA if no experience
return a single row frame

We wrap the above step in a do.call(rbind,...) call to return the result

do.call(
  rbind,
  lapply(unique(df$Location), \(loc) {
  ct = df_complete %>% 
    group_by(Player) %>% 
    mutate(minyear = min(Year[Location==loc & tag])) %>% 
    ungroup() %>% 
    filter(is.finite(minyear),Year<minyear) %>% 
    count(Location) %>% 
    pull(n)

  wt = if_else(length(ct)>0, sum((ct/sum(ct))^2), as.double(NA))
           
  data.frame(Locations = loc,"Weighted Measure" = wt)
}))

Output:

  Locations Weighted.Measure
1   Chicago        0.3750000
2    Boston        0.3553719
3  New York               NA

CodePudding user response：

Here's an answer that expands to all player-years, computes how many years were spent at each previous location prior to that year, filters to only years when there was a move, and then computes the score:

library(tidyverse)

df <- tribble(~Year,   ~Player,    ~Location,
              2005,   "Phelan",    "Chicago", 
              2007,   "Phelan",    "Boston", 
              2008,   "Phelan",    "Boston", 
              2010,   "Phelan",    "Chicago",  
              2011,   "Phelan",    "Boston" , 
              2002,   "John",    "New York", 
              2006,   "John",    "New York", 
              2007,   "John",    "Boston", 
              2009,   "John",    "Chicago")

locations <- unique(df$Location)

df %>% 
  group_by(Player) %>% 
  complete(Year = full_seq(Year, 1)) %>% 
  fill(Location) %>% 
  arrange(Player, Year) %>% 
  add_column(!!!set_names(rep(NA_real_, length(locations)), locations)) %>% 
  mutate(across(-c(Year, Location), ~lag(cumsum(Location == cur_column()))),
         move = Location != lag(Location)) %>% 
  filter(move) %>% 
  group_by(Location, Player) %>% 
  slice_tail(n = 1) %>% 
  pivot_longer(all_of(locations)) %>% 
  group_by(Location, name) %>% 
  summarise(yrs = sum(value)) %>% 
  summarise(score = sum(map_dbl(yrs, ~.^2))/ sum(yrs)^2)

#> # A tibble: 2 × 2
#>   Location score
#>   <chr>    <dbl>
#> 1 Boston   0.355
#> 2 Chicago  0.375

^{Created on 2022-05-26 by the reprex package (v2.0.1)}