I have a dataframe like below (the real data has many more people and locations):
Year Player Location
2005 Phelan Chicago
2007 Phelan Boston
2008 Phelan Boston
2010 Phelan Chicago
2011 Phelan Boston
2002 John New York
2006 John New York
2007 John Boston
2009 John Chicago
I want to calculate a location level measure but still consider the player-level information. Specifically, I want to calculate a weighted measure for each location that describes how many different previous locations lead to the current location.
For example, for Chicago, Phelan came here in 2005 and then left. He returned in 2010 after staying in Boston. John came to Chicago in 2009 after staying in New York and Boston. Noting that a person can leave and return to a same place (like Chicago for Phelan) and for such cases, I only want to consider the most recent stay to avoid double/multiple counting.
Before Phelan's most recent stay at Chicago, he stayed in Chicago for 2 years (2005-2006, assuming 2006 Phelan stayed in Chicago), Boston for 3 years (2007-2009, assuming 2009 Phelan stayed in Boston). Before John's most recent stay at Chicago, he stayed in New York for 5 years (2002-2006), Boston for 2 years (2007-2008). For Chicago, based on Phelan and John's previous experience in Chicago, it accumulates 2 3 5 2=12 years of previous experience. Among these 12 years, 2 years are from Chicago, 5 years are from Boston, and 5 years are from New York. We can then calculate the measure for Chicago by (2/12)^2 (5/12)^2 (5/12)^2=0.375. This number is the weighted measure for Chicago that describes how many different previous locations lead to Chicago.
For Boston, before Phelan's most recent stay at Boston, he stayed in Chicago for 3 years (2005-2008,2009), Boston for 3 years (2007-2009). Before John's most recent stay at Boston, he stayed in New York for 5 years (2002-2006). We can then calculate the measure for New York by (3/11)^2 (3/11)^2 (5/11)^2=0.355.
Below is the sample output:
Location Weighted Measure
Chicago 0.375
Boston 0.355
New York NA
CodePudding user response:
Here is a rather verbose approach using tidyr
, dplyr
and data.table::rleid
- Create a more complete version of df, called
df_complete
that fills in any missing (assumed years), and tags the final visit for each player in each location
library(tidyr); library(dplyr); library(data.table)
df_complete = df %>%
group_by(Player) %>%
complete(Year = seq(min(Year), max(Year),1)) %>%
fill(Location) %>%
mutate(tag = data.table::rleid(Location)) %>%
group_by(Player,Location) %>%
mutate(tag=max(tag)==tag) %>%
ungroup()
- Next, for each unique location in the original frame,
df
, we uselapply()
to:
- restrict, by player, to rows where there is experience prior to coming to that location
- count the years by the prior locations
- estimate the weighted measure (
wt
), equal toNA
if no experience - return a single row frame
We wrap the above step in a do.call(rbind,...)
call to return the result
do.call(
rbind,
lapply(unique(df$Location), \(loc) {
ct = df_complete %>%
group_by(Player) %>%
mutate(minyear = min(Year[Location==loc & tag])) %>%
ungroup() %>%
filter(is.finite(minyear),Year<minyear) %>%
count(Location) %>%
pull(n)
wt = if_else(length(ct)>0, sum((ct/sum(ct))^2), as.double(NA))
data.frame(Locations = loc,"Weighted Measure" = wt)
}))
Output:
Locations Weighted.Measure
1 Chicago 0.3750000
2 Boston 0.3553719
3 New York NA
CodePudding user response:
Here's an answer that expands to all player-years, computes how many years were spent at each previous location prior to that year, filters to only years when there was a move, and then computes the score:
library(tidyverse)
df <- tribble(~Year, ~Player, ~Location,
2005, "Phelan", "Chicago",
2007, "Phelan", "Boston",
2008, "Phelan", "Boston",
2010, "Phelan", "Chicago",
2011, "Phelan", "Boston" ,
2002, "John", "New York",
2006, "John", "New York",
2007, "John", "Boston",
2009, "John", "Chicago")
locations <- unique(df$Location)
df %>%
group_by(Player) %>%
complete(Year = full_seq(Year, 1)) %>%
fill(Location) %>%
arrange(Player, Year) %>%
add_column(!!!set_names(rep(NA_real_, length(locations)), locations)) %>%
mutate(across(-c(Year, Location), ~lag(cumsum(Location == cur_column()))),
move = Location != lag(Location)) %>%
filter(move) %>%
group_by(Location, Player) %>%
slice_tail(n = 1) %>%
pivot_longer(all_of(locations)) %>%
group_by(Location, name) %>%
summarise(yrs = sum(value)) %>%
summarise(score = sum(map_dbl(yrs, ~.^2))/ sum(yrs)^2)
#> # A tibble: 2 × 2
#> Location score
#> <chr> <dbl>
#> 1 Boston 0.355
#> 2 Chicago 0.375
Created on 2022-05-26 by the reprex package (v2.0.1)