Home > Software design >  How to do a year to year comparison using R, dplyr?
How to do a year to year comparison using R, dplyr?

Time:04-04

I'm trying to do a year-to-year comparison for each player in a dataframe so I can plot Year 1 vs Year 2 in a scatterplot on the x and y axis' respectively. I think I need to use a for loop with a nested if_else statement but I feel like that may be overcomplicating things.

Here is what I have:

| player |season|average|
| ------ | ---- |-------|
| A.Lee  | 2019 |  45   |
| A.Lee  | 2020 |  40   |
| A.Lee  | 2021 |  46   |
| A.Cole | 2020 |  39   |
| A.Cole | 2021 |  42   |
    
Here are the desired results:
    
| player |  Y1  |   Y2    |
| ------ | ---- | ------- |
| A.Lee  |  45  |   40    |
| A.Lee  |  40  |   46    |
| A.Cole |  39  |   42    |

CodePudding user response:

Grouped by 'player', get the lead values of 'average' as 'Y1' and 'average' as 'Y2' within summarise and remove the NA elements (by default the default option in lead/lag are NA)

library(dplyr)
df %>% 
   arrange(player, desc(season)) %>%
   group_by(player) %>% 
   summarise(Y1 = lead(average), Y2 = average, .groups = 'drop') %>%
   na.omit

-output

# A tibble: 3 × 3
  player    Y1    Y2
  <chr>  <int> <int>
1 A.Cole    39    42
2 A.Lee     40    46
3 A.Lee     45    40

data

df <- structure(list(player = c("A.Lee", "A.Lee", "A.Lee", "A.Cole", 
"A.Cole"), season = c(2019, 2020, 2021, 2020, 2021), average = c(45L, 
40L, 46L, 39L, 42L)), row.names = c(NA, -5L), class = "data.frame")

CodePudding user response:

Here is a different approach using mutate(across... with lead function and filter and select:

library(dplyr)

df %>%
  mutate(across(c(season, average), lead, .names = "{.col}1")) %>% 
  mutate(diff = abs(season1 - season), .keep="unused") %>% 
  filter(diff ==1) %>% 
  dplyr::select(player, Y1 = average, Y2 = average1)
 
  player Y1 Y2
1  A.Lee 45 40
2  A.Lee 40 46
3 A.Cole 39 42
  • Related