Home > Net >  Using dplyr in R to check whether observations are present in the next time period in panel data
Using dplyr in R to check whether observations are present in the next time period in panel data

Time:12-06

I have a dataset which looks something like this:

ID Year Set Value
1 2020 A 20
1 2020 B 30
1 2021 C 28
1 2021 D 21
1 2022 E 28
1 2022 F 21
2 2020 A 20
2 2020 B 30
2 2021 C 28
2 2021 D 21
3 2020 A 20
3 2020 B 30

So it's a panel dataset, which has people (indexed by ID), observed over multiple years, with different values of different things. I want to create a variable which takes on a value of TRUE if the person is present in the following year (like 1 for both years & 2 for 2020), and FALSE if it isn't (like 3 in 2020 and 2 in 2021). And there are a lot more than three years for some of them, but only one year for some. I feel like this should be easy using either dplyr or data.table, but I can't wrap my head around it. Any help would be most welcome!

CodePudding user response:

Here is one potential solution:

library(tidyverse)

df <- tibble::tribble(
        ~ID, ~Year, ~Set, ~Value,
         1L, 2020L,  "A",    20L,
         1L, 2020L,  "B",    30L,
         1L, 2021L,  "C",    28L,
         1L, 2021L,  "D",    21L,
         1L, 2022L,  "E",    28L,
         1L, 2022L,  "F",    21L,
         2L, 2020L,  "A",    20L,
         2L, 2020L,  "B",    30L,
         2L, 2021L,  "C",    28L,
         2L, 2021L,  "D",    21L,
         3L, 2020L,  "A",    20L,
         3L, 2020L,  "B",    30L
        )

df %>%
  group_by(ID) %>%
  mutate(present_in_following_year = Year != max(Year))
#> # A tibble: 12 × 5
#> # Groups:   ID [3]
#>       ID  Year Set   Value present_in_following_year
#>    <int> <int> <chr> <int> <lgl>                    
#>  1     1  2020 A        20 TRUE                     
#>  2     1  2020 B        30 TRUE                     
#>  3     1  2021 C        28 TRUE                     
#>  4     1  2021 D        21 TRUE                     
#>  5     1  2022 E        28 FALSE                    
#>  6     1  2022 F        21 FALSE                    
#>  7     2  2020 A        20 TRUE                     
#>  8     2  2020 B        30 TRUE                     
#>  9     2  2021 C        28 FALSE                    
#> 10     2  2021 D        21 FALSE                    
#> 11     3  2020 A        20 FALSE                    
#> 12     3  2020 B        30 FALSE

Created on 2022-12-06 with reprex v2.0.2

Does that solve your problem, or have I misunderstood the question?

CodePudding user response:

With data.table:

library(data.table)

unique(dt[,.(ID,Year)])[,.(Year,next_year=fifelse(Year==shift(Year,-1)-1,T,F,na = F)),ID][
                        dt,on=.(ID,Year)]

#       ID  Year next_year    Set Value
#    <int> <int>    <lgcl> <char> <int>
#1:      1  2020      TRUE      A    20
#2:      1  2020      TRUE      B    30
#3:      1  2021      TRUE      C    28
#4:      1  2021      TRUE      D    21
#5:      1  2022     FALSE      E    28
#6:      1  2022     FALSE      F    21
#7:      2  2020      TRUE      A    20
#8:      2  2020      TRUE      B    30
#9:      2  2021     FALSE      C    28
#10:     2  2021     FALSE      D    21
#11:     3  2020     FALSE      A    20
#12:     3  2020     FALSE      B    30

Data:

dt <- setDT(read.table(text='
ID  Year    Set     Value
1   2020    A   20
1   2020    B   30
1   2021    C   28
1   2021    D   21
1   2022    E   28
1   2022    F   21
2   2020    A   20
2   2020    B   30
2   2021    C   28
2   2021    D   21
3   2020    A   20
3   2020    B   30',header=T))

CodePudding user response:

Options that don't rely on shifting/lagging or the order of data:

dplyr

library(dplyr)
panel %>%
  group_by(ID) %>%
  mutate(next_year = sapply(Year, function(Y) (Y 1) %in% Year)) %>%
  ungroup()
# # A tibble: 12 x 5
#       ID  Year Set   Value next_year
#    <int> <int> <chr> <int> <lgl>    
#  1     1  2020 A        20 TRUE     
#  2     1  2020 B        30 TRUE     
#  3     1  2021 C        28 TRUE     
#  4     1  2021 D        21 TRUE     
#  5     1  2022 E        28 FALSE    
#  6     1  2022 F        21 FALSE    
#  7     2  2020 A        20 TRUE     
#  8     2  2020 B        30 TRUE     
#  9     2  2021 C        28 FALSE    
# 10     2  2021 D        21 FALSE    
# 11     3  2020 A        20 FALSE    
# 12     3  2020 B        30 FALSE    

base R

panel$next_year <- ave(panel$Year, panel$ID,
                       FUN = function(Y) sapply(Y, function(z) (z 1) %in% Y)) > 0
## same results

data.table

library(data.table)
setDT(panel)[, next_year := sapply(Year, function(Y) (Y 1) %in% Year),
             by = .(ID)]
  • Related