Home > Net >  Create increment variable based on a change in two columns
Create increment variable based on a change in two columns

Time:03-31

My data looks something like this:

id <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2)
var1 <- c(1,2,2,3,4,4,4,4,1,1,2,2,2,3)
var2 <- c(1,2,2,2,2,3,4,4,1,1,1,2,2,2)

df <- data.frame(id,var1,var2)

Which would look like this:

id  var1  var2
1   1     1
1   2     2
1   2     2
1   3     2
1   4     2
1   4     3
1   4     4
1   4     4
2   1     1 
2   1     1
2   2     1
2   2     2
2   2     2
2   3     2

I would like to create a new column which increments by 1 for each change in either var1 or var2 and resets for each id. So my desired output would be:

id  var1  var2  var3
1   1     1     1
1   2     2     2
1   2     2     2
1   3     2     3
1   4     2     4
1   4     3     5
1   4     4     6
1   4     4     6
2   1     1     1
2   1     1     1
2   2     1     2
2   2     2     3
2   2     2     3
2   3     2     4

I tried multiple things I found here, but nothing does exactly what I am looking for. For example both:

DT <- as.data.table(df, keep.rownames = T)
DT[, var3 := .GRP, by = list(id, var1, var2)][]

or 

indx <- as.character(interaction(df[c("id", "var1", "var2")]))
df$var3<- cumsum(c(TRUE,indx[-1]!=indx[-length(indx)]))

do not reset for a new subject. I appreciate your help.

CodePudding user response:

df %>% 
  group_by(id) %>% 
  mutate(var3 = cumsum(c(0, diff(var1) | diff(var2)) != 0)   1) %>%
  ungroup

# A tibble: 14 x 4
      id  var1  var2  var3
   <dbl> <dbl> <dbl> <dbl>
 1     1     1     1     1
 2     1     2     2     2
 3     1     2     2     2
 4     1     3     2     3
 5     1     4     2     4
 6     1     4     3     5
 7     1     4     4     6
 8     1     4     4     6
 9     2     1     1     1
10     2     1     1     1
11     2     2     1     2
12     2     2     2     3
13     2     2     2     3
14     2     3     2     4

CodePudding user response:

Using cumsum, increment by 1 if:

  • the previous value of var1 is different from the current value
  • the previous value of var2 is different from the current value.
library(tidyverse)

df %>% 
  group_by(id) %>% 
  mutate(var3 = cumsum(lag(var1, default = 0) != var1 | lag(var2, default = 0) != var2))

If you have more than two columns, you can use if_any:

df %>% 
  group_by(id) %>% 
  mutate(var3 = cumsum(if_any(var1:var2, ~ c(0, diff(.x)) != 0))   1)

output

# A tibble: 14 x 4
# Groups:   id [2]
      id  var1  var2  var3
   <dbl> <dbl> <dbl> <int>
 1     1     1     1     1
 2     1     2     2     2
 3     1     2     2     2
 4     1     3     2     3
 5     1     4     2     4
 6     1     4     3     5
 7     1     4     4     6
 8     1     4     4     6
 9     2     1     1     1
10     2     1     1     1
11     2     2     1     2
12     2     2     2     3
13     2     2     2     3
14     2     3     2     4

CodePudding user response:

Using data.table - shift can take multiple columns at once, by default the type = "lag", use Map to check if the corresponding values are not equal to the columns (after grouping by 'id') and then Reduce the logical columns to a single logical vector and get the cumsum

library(data.table)
setDT(df)[, var3 := cumsum(Reduce(`|`, Map(`!=` ,
     shift(.SD, fill = 0), .SD))), id]

-output

> df
       id  var1  var2  var3
    <num> <num> <num> <int>
 1:     1     1     1     1
 2:     1     2     2     2
 3:     1     2     2     2
 4:     1     3     2     3
 5:     1     4     2     4
 6:     1     4     3     5
 7:     1     4     4     6
 8:     1     4     4     6
 9:     2     1     1     1
10:     2     1     1     1
11:     2     2     1     2
12:     2     2     2     3
13:     2     2     2     3
14:     2     3     2     4
  • Related