I have a dataset which looks something like this:
ID | Year | Set | Value |
---|---|---|---|
1 | 2020 | A | 20 |
1 | 2020 | B | 30 |
1 | 2021 | C | 28 |
1 | 2021 | D | 21 |
1 | 2022 | E | 28 |
1 | 2022 | F | 21 |
2 | 2020 | A | 20 |
2 | 2020 | B | 30 |
2 | 2021 | C | 28 |
2 | 2021 | D | 21 |
3 | 2020 | A | 20 |
3 | 2020 | B | 30 |
So it's a panel dataset, which has people (indexed by ID), observed over multiple years, with different values of different things. I want to create a variable which takes on a value of TRUE if the person is present in the following year (like 1 for both years & 2 for 2020), and FALSE if it isn't (like 3 in 2020 and 2 in 2021). And there are a lot more than three years for some of them, but only one year for some. I feel like this should be easy using either dplyr or data.table, but I can't wrap my head around it. Any help would be most welcome!
CodePudding user response:
Here is one potential solution:
library(tidyverse)
df <- tibble::tribble(
~ID, ~Year, ~Set, ~Value,
1L, 2020L, "A", 20L,
1L, 2020L, "B", 30L,
1L, 2021L, "C", 28L,
1L, 2021L, "D", 21L,
1L, 2022L, "E", 28L,
1L, 2022L, "F", 21L,
2L, 2020L, "A", 20L,
2L, 2020L, "B", 30L,
2L, 2021L, "C", 28L,
2L, 2021L, "D", 21L,
3L, 2020L, "A", 20L,
3L, 2020L, "B", 30L
)
df %>%
group_by(ID) %>%
mutate(present_in_following_year = Year != max(Year))
#> # A tibble: 12 × 5
#> # Groups: ID [3]
#> ID Year Set Value present_in_following_year
#> <int> <int> <chr> <int> <lgl>
#> 1 1 2020 A 20 TRUE
#> 2 1 2020 B 30 TRUE
#> 3 1 2021 C 28 TRUE
#> 4 1 2021 D 21 TRUE
#> 5 1 2022 E 28 FALSE
#> 6 1 2022 F 21 FALSE
#> 7 2 2020 A 20 TRUE
#> 8 2 2020 B 30 TRUE
#> 9 2 2021 C 28 FALSE
#> 10 2 2021 D 21 FALSE
#> 11 3 2020 A 20 FALSE
#> 12 3 2020 B 30 FALSE
Created on 2022-12-06 with reprex v2.0.2
Does that solve your problem, or have I misunderstood the question?
CodePudding user response:
With data.table
:
library(data.table)
unique(dt[,.(ID,Year)])[,.(Year,next_year=fifelse(Year==shift(Year,-1)-1,T,F,na = F)),ID][
dt,on=.(ID,Year)]
# ID Year next_year Set Value
# <int> <int> <lgcl> <char> <int>
#1: 1 2020 TRUE A 20
#2: 1 2020 TRUE B 30
#3: 1 2021 TRUE C 28
#4: 1 2021 TRUE D 21
#5: 1 2022 FALSE E 28
#6: 1 2022 FALSE F 21
#7: 2 2020 TRUE A 20
#8: 2 2020 TRUE B 30
#9: 2 2021 FALSE C 28
#10: 2 2021 FALSE D 21
#11: 3 2020 FALSE A 20
#12: 3 2020 FALSE B 30
Data:
dt <- setDT(read.table(text='
ID Year Set Value
1 2020 A 20
1 2020 B 30
1 2021 C 28
1 2021 D 21
1 2022 E 28
1 2022 F 21
2 2020 A 20
2 2020 B 30
2 2021 C 28
2 2021 D 21
3 2020 A 20
3 2020 B 30',header=T))
CodePudding user response:
Options that don't rely on shifting/lagging or the order of data:
dplyr
library(dplyr)
panel %>%
group_by(ID) %>%
mutate(next_year = sapply(Year, function(Y) (Y 1) %in% Year)) %>%
ungroup()
# # A tibble: 12 x 5
# ID Year Set Value next_year
# <int> <int> <chr> <int> <lgl>
# 1 1 2020 A 20 TRUE
# 2 1 2020 B 30 TRUE
# 3 1 2021 C 28 TRUE
# 4 1 2021 D 21 TRUE
# 5 1 2022 E 28 FALSE
# 6 1 2022 F 21 FALSE
# 7 2 2020 A 20 TRUE
# 8 2 2020 B 30 TRUE
# 9 2 2021 C 28 FALSE
# 10 2 2021 D 21 FALSE
# 11 3 2020 A 20 FALSE
# 12 3 2020 B 30 FALSE
base R
panel$next_year <- ave(panel$Year, panel$ID,
FUN = function(Y) sapply(Y, function(z) (z 1) %in% Y)) > 0
## same results
data.table
library(data.table)
setDT(panel)[, next_year := sapply(Year, function(Y) (Y 1) %in% Year),
by = .(ID)]