R - manipulating time series data-CodePudding

I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.

I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again). Healthy units that never transition to another class will remain classified as healthy throughout the time series.

I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated

example of dataset for two units; one remains health throughout the time series and another has multiple transitions.

            UID annual_change_val year
1      control1                 1 1990
4      control1                 1 1991
5      control1                 1 1992
7      control1                 1 1993
9      control1                 1 1994
12     control1                 1 1995
13     control1                 1 1996
16     control1                 1 1997
18     control1                 1 1998
20     control1                 1 1999
22     control1                 1 2000
24     control1                 1 2001
26     control1                 1 2002
28     control1                 1 2003
30     control1                 1 2004
31     control1                 1 2005
33     control1                 1 2006
35     control1                 1 2007
38     control1                 1 2008
40     control1                 1 2009
42     control1                 1 2010
44     control1                 1 2011
46     control1                 1 2012
48     control1                 1 2013
50     control1                 1 2014
52     control1                 1 2015
53     control1                 1 2016
55     control1                 1 2017
57     control1                 1 2018
59     control1                 1 2019
61     control1                 1 2020
2  control64167                 1 1990
3  control64167                 1 1991
6  control64167                 1 1992
8  control64167                 2 1993
10 control64167                 2 1994
11 control64167                 2 1995
14 control64167                 2 1996
15 control64167                 2 1997
17 control64167                 3 1998
19 control64167                 3 1999
21 control64167                 4 2000
23 control64167                 4 2001
25 control64167                 4 2002
27 control64167                 4 2003
29 control64167                 3 2004
32 control64167                 4 2005
34 control64167                 4 2006
36 control64167                 4 2007
37 control64167                 4 2008
39 control64167                 4 2009
41 control64167                 4 2010
43 control64167                 4 2011
45 control64167                 4 2012
47 control64167                 4 2013
49 control64167                 4 2014
51 control64167                 4 2015
54 control64167                 4 2016
56 control64167                 4 2017
58 control64167                 4 2018
60 control64167                 4 2019
62 control64167                 4 2020

CodePudding user response：

If for some reason you only want to use base R,

df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3

The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.

CodePudding user response：

Update, based on comment/clarification.

Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).

d %>%
  mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>% 
  group_by(UID) %>% 
  mutate(max_inf = max(year[which(status==3)],na.rm=T),
         status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>% 
  select(!max_inf)

You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then

library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))

library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]