Home > Enterprise >  Flag only the first time a column value changes rowwise per id in R
Flag only the first time a column value changes rowwise per id in R

Time:05-19

I'll get straight to the point. I found some code on stackoverflow that partially works.

df1 <- read.table(text = "
ID  V1    V2
A   X     SBI123
A   Y     SBI123
A   Y     SBI133
B   A     SBI888
B   A     SBI999
B   B     SBI999
", header = TRUE)

# Rowwise comparison per column
setDT(df1)[, flag_V1 := 0][V1!="", flag_V1 := 1*(rleid(V1)-1 > 0), by = ID]
setDT(df1)[, flag_V2 := 0][V2!="", flag_V2 := 1*(rleid(V2)-1 > 0), by = ID]
df1

# Output
   ID V1     V2 flag_V1 flag_V2
1:  A  X SBI123       0       0
2:  A  Y SBI123       1       0
3:  A  Y SBI133       1       1
4:  B  A SBI888       0       0
5:  B  A SBI999       0       1
6:  B  B SBI999       1       1

So the first time the V1 value for ID 'A' changes the 'flag_V1' is 1 which is correct. What I want is the third line to be 0. I understand that the code compares all the column values to the first row which makes this code correct but I only want the first time the value changes to be flagged. The desired output:

# Desired output
   ID V1     V2 flag_V1 flag_V2
1:  A  X SBI123       0       0
2:  A  Y SBI123       1       0
3:  A  Y SBI133       0       1
4:  B  A SBI888       0       0
5:  B  A SBI999       0       1
6:  B  B SBI999       1       0

I hope this makes sense and that someone can help me out, thanks!

CodePudding user response:

In dplyr, you can use across and lag. Basically, the value is 1 when the previous value is different from the actual value, and 0 otherwise.

library(dplyr)

df1 %>% 
  group_by(ID) %>% 
  mutate(across(V1:V2, ~  (lag(.x, default = first(.x)) != .x), .names = "flag_{col}"))

# A tibble: 6 × 5
# Groups:   ID [2]
  ID    V1    V2     flag_V1 flag_V2
  <chr> <chr> <chr>    <int>   <int>
1 A     X     SBI123       0       0
2 A     Y     SBI123       1       0
3 A     Y     SBI133       0       1
4 B     A     SBI888       0       0
5 B     A     SBI999       0       1
6 B     B     SBI999       1       0

CodePudding user response:

You can just convert the duplicates to 0, i.e. for V1,

library(data.table)

setDT(df1)[, flag_V1 := 0][
           V1!="", flag_V1 := 1*(rleid(V1)-1 > 0), by = ID][, 
                   lapply(.SD, function(i) replace(i, duplicated(i), 0)), by = .(ID, V1)][]

   ID V1     V2 flag_V1
1:  A  X SBI123       0
2:  A  Y SBI123       1
3:  A  Y SBI133       0
4:  B  A SBI888       0
5:  B  A SBI999       0
6:  B  B SBI999       1
  • Related