I have the following dataset.
id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]
My goal is to create a for-loop with two indices one for id
and the other for year
so that it runs through the id
first and then within each id
it looks at years
in which there was a change in the status
. To record the changes with this loop, I want another variable that shows in which the change happened.
For example, in the dataframe below the variable change
records 0 for id
1001 in all three years. But for 1002, a change in status is recorded with 1 in year 2013. For 1005, status changes twice, in 2013 and 2016, that's why 1 is recorded twice. btw, id is a character variable because the real data I am working on has alpha-numeric ids.
id year status change
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
5 1002 2010 4 0
4 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
9 1005 2010 1 0
10 1005 2013 2 1
8 1005 2016 1 1
The actual dataframe has over 600k observations. Loop takes a lot of time running. I am open to faster solutions too.
My code is below:
df$change<-NA df$id<-as.character(df$id) for(id in unique(df$id)) {
tau<-df$year[df$id==id] if (length(tau)>1) {
for( j in 1:(length(tau)-1)){
if (df$status[df$year==tau[j] & df$id==id] != df$status[df$year==tau[j 1]& df$id==id]) {
df$change[df$year==tau[j] & df$id==id]<-0
df$change[df$year==tau[j 1] & df$id==id]<-1
} else {
df$change[df$year==tau[j] & df$id==id]<-0
df$change[df$year==tau[j 1] & df$id==id]<-0
}}}
CodePudding user response:
You could do:
Base R:
df |>
transform(change = ave(status, id, FUN = \(x)c(0, diff(x))!=0))
In tidyverse:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(change = c(0, diff(status)!=0))
id year status change
<dbl> <dbl> <dbl> <dbl>
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
4 1002 2010 4 0
5 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
8 1005 2010 1 0
9 1005 2013 5 1
10 1005 2016 1 1
CodePudding user response:
Does this yield the correct result?
library(dplyr)
id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]
df %>%
group_by(id) %>%
mutate(change = as.numeric(status != lag(status,
default = first(status))))
#> # A tibble: 10 x 4
#> id year status change
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1001 2010 2 0
#> 2 1001 2013 2 0
#> 3 1001 2016 2 0
#> 4 1002 2010 4 0
#> 5 1002 2013 3 1
#> 6 1003 2010 2 0
#> 7 1004 2016 1 0
#> 8 1005 2010 1 0
#> 9 1005 2013 5 1
#> 10 1005 2016 1 1
Note: I put the "NA replacement" in a second mutate since this step does not have to be on the grouped data which is then faster for large datasets
CodePudding user response:
We can use ifelse
with a logical comparison between status
and lag(status)
. The key is the argument default = first(status)
, which eliminates common problems with NAs in the output.
df %>% group_by(id) %>%
mutate(change=ifelse(status==lag(status, default = first(status)), 0, 1))
# A tibble: 10 x 4
# Groups: id [5]
id year status change
<dbl> <dbl> <dbl> <dbl>
1 1001 2010 2 0
2 1001 2013 2 0
3 1001 2016 2 0
4 1002 2010 4 0
5 1002 2013 3 1
6 1003 2010 2 0
7 1004 2016 1 0
8 1005 2010 1 0
9 1005 2013 5 1
10 1005 2016 1 1