Remove duplicate among consecutive values within a dataframe in R-CodePudding

I have a datafram such as

COL1 COL2 COL3
G1 1 6
G1 2 6
G1 3 7
G1 4 9
G1 5 9
G1 6 9
G1 7 6
G1 8 6
G1 9 7
G1 10 7
G1 11 7
G1 12 8
G1 13 7

and I would like to remove duplicate of consecutive COL3 values and keep the first.

here I should then get:

COL1 COL2 COL3
G1 1 6
G1 3 7
G1 4 9
G1 7 6
G1 9 7
G1 12 8
G1 13 7

Here is the dput format if it can helps:

structure(list(COL1 = c("G1", "G1", "G1", "G1", "G1", "G1", "G1", 
"G1", "G1", "G1", "G1", "G1", "G1"), COL2 = 1:13, COL3 = c(6L, 
6L, 7L, 9L, 9L, 9L, 6L, 6L, 7L, 7L, 7L, 8L, 7L)), class = "data.frame", row.names = c(NA, 
-13L))

CodePudding user response：

In base R, you can use run-length encoding (rle):

df[nrow(df) - rev(cumsum(rle(rev(df$COL3))$lengths) - 1),]
#>    COL1 COL2 COL3
#> 1    G1    1    6
#> 3    G1    3    7
#> 4    G1    4    9
#> 7    G1    7    6
#> 9    G1    9    7
#> 12   G1   12    8
#> 13   G1   13    7

Or rleid from data.table

df[c(TRUE, diff(data.table::rleid(df$COL3)) == 1),]
#>    COL1 COL2 COL3
#> 1    G1    1    6
#> 3    G1    3    7
#> 4    G1    4    9
#> 7    G1    7    6
#> 9    G1    9    7
#> 12   G1   12    8
#> 13   G1   13    7

Or only keep rows that don't match their own lag:

df[df$COL3 != dplyr::lag(df$COL3, default = pi),]
#>    COL1 COL2 COL3
#> 1    G1    1    6
#> 3    G1    3    7
#> 4    G1    4    9
#> 7    G1    7    6
#> 9    G1    9    7
#> 12   G1   12    8
#> 13   G1   13    7

CodePudding user response：

Another possible solution, based on dplyr:

library(dplyr)

df %>% 
  filter(COL3 != lead(COL3, default = Inf))

#>   COL1 COL2 COL3
#> 1   G1    2    6
#> 2   G1    3    7
#> 3   G1    6    9
#> 4   G1    8    6
#> 5   G1   11    7
#> 6   G1   12    8
#> 7   G1   13    7

CodePudding user response：

using data.table,

temp[COL3!=lag(COL3,default = 0)]

output

   COL1 COL2 COL3
1:   G1    1    6
2:   G1    3    7
3:   G1    4    9
4:   G1    7    6
5:   G1    9    7
6:   G1   12    8
7:   G1   13    7