Home > Software engineering >  How to use an index within another index to locate a change in a variable - R
How to use an index within another index to locate a change in a variable - R

Time:09-17

I have the following dataset.

id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]

My goal is to create a for-loop with two indices one for id and the other for year so that it runs through the id first and then within each id it looks at years in which there was a change in the status. To record the changes with this loop, I want another variable that shows in which the change happened. For example, in the dataframe below the variable change records 0 for id 1001 in all three years. But for 1002, a change in status is recorded with 1 in year 2013. For 1005, status changes twice, in 2013 and 2016, that's why 1 is recorded twice. btw, id is a character variable because the real data I am working on has alpha-numeric ids.

     id year status change
1  1001 2010      2   0
2  1001 2013      2   0
3  1001 2016      2   0
5  1002 2010      4   0
4  1002 2013      3   1
6  1003 2010      2   0
7  1004 2016      1   0
9  1005 2010      1   0
10 1005 2013      2   1
8  1005 2016      1   1

The actual dataframe has over 600k observations. Loop takes a lot of time running. I am open to faster solutions too.

My code is below:

df$change<-NA df$id<-as.character(df$id) for(id in unique(df$id)) {
    tau<-df$year[df$id==id]   if (length(tau)>1) {
    for( j in 1:(length(tau)-1)){ 
      if (df$status[df$year==tau[j] & df$id==id] != df$status[df$year==tau[j 1]& df$id==id]) {
       df$change[df$year==tau[j]    & df$id==id]<-0
       df$change[df$year==tau[j 1]  & df$id==id]<-1
    } else {
       df$change[df$year==tau[j]    & df$id==id]<-0
       df$change[df$year==tau[j 1]  & df$id==id]<-0
    }}}

CodePudding user response:

You could do:

Base R:

df |> 
 transform(change = ave(status, id, FUN = \(x)c(0, diff(x))!=0))

In tidyverse:

library(tidyverse)
df %>%
  group_by(id) %>%
  mutate(change = c(0, diff(status)!=0))

      id  year status change
   <dbl> <dbl>  <dbl>  <dbl>
 1  1001  2010      2      0
 2  1001  2013      2      0
 3  1001  2016      2      0
 4  1002  2010      4      0
 5  1002  2013      3      1
 6  1003  2010      2      0
 7  1004  2016      1      0
 8  1005  2010      1      0
 9  1005  2013      5      1
10  1005  2016      1      1

CodePudding user response:

Does this yield the correct result?

library(dplyr)

id<-c(1001,1001,1001,1002,1002,1003,1004,1005,1005,1005)
year<-c(2010,2013,2016, 2013,2010,2010,2016,2016,2010,2013)
status<-c(2,2,2,3,4,2,1,1,1,5)
df<-data.frame(id, year, status)
df <- df[order(df$id, df$year), ]

df %>%
  group_by(id) %>%
  mutate(change = as.numeric(status != lag(status,
                                           default = first(status))))
#> # A tibble: 10 x 4
#>       id  year status change
#>    <dbl> <dbl>  <dbl>  <dbl>
#>  1  1001  2010      2      0
#>  2  1001  2013      2      0
#>  3  1001  2016      2      0
#>  4  1002  2010      4      0
#>  5  1002  2013      3      1
#>  6  1003  2010      2      0
#>  7  1004  2016      1      0
#>  8  1005  2010      1      0
#>  9  1005  2013      5      1
#> 10  1005  2016      1      1

Note: I put the "NA replacement" in a second mutate since this step does not have to be on the grouped data which is then faster for large datasets

CodePudding user response:

We can use ifelse with a logical comparison between status and lag(status). The key is the argument default = first(status), which eliminates common problems with NAs in the output.

df %>% group_by(id) %>%
mutate(change=ifelse(status==lag(status, default = first(status)), 0, 1))

# A tibble: 10 x 4
# Groups:   id [5]
      id  year status change
   <dbl> <dbl>  <dbl>  <dbl>
 1  1001  2010      2      0
 2  1001  2013      2      0
 3  1001  2016      2      0
 4  1002  2010      4      0
 5  1002  2013      3      1
 6  1003  2010      2      0
 7  1004  2016      1      0
 8  1005  2010      1      0
 9  1005  2013      5      1
10  1005  2016      1      1
  • Related