I have a datafram such as
COL1 COL2 COL3
G1 1 6
G1 2 6
G1 3 7
G1 4 9
G1 5 9
G1 6 9
G1 7 6
G1 8 6
G1 9 7
G1 10 7
G1 11 7
G1 12 8
G1 13 7
and I would like to remove duplicate of consecutive COL3
values and keep the first.
here I should then get:
COL1 COL2 COL3
G1 1 6
G1 3 7
G1 4 9
G1 7 6
G1 9 7
G1 12 8
G1 13 7
Here is the dput format if it can helps:
structure(list(COL1 = c("G1", "G1", "G1", "G1", "G1", "G1", "G1",
"G1", "G1", "G1", "G1", "G1", "G1"), COL2 = 1:13, COL3 = c(6L,
6L, 7L, 9L, 9L, 9L, 6L, 6L, 7L, 7L, 7L, 8L, 7L)), class = "data.frame", row.names = c(NA,
-13L))
CodePudding user response:
In base R, you can use run-length encoding (rle
):
df[nrow(df) - rev(cumsum(rle(rev(df$COL3))$lengths) - 1),]
#> COL1 COL2 COL3
#> 1 G1 1 6
#> 3 G1 3 7
#> 4 G1 4 9
#> 7 G1 7 6
#> 9 G1 9 7
#> 12 G1 12 8
#> 13 G1 13 7
Or rleid
from data.table
df[c(TRUE, diff(data.table::rleid(df$COL3)) == 1),]
#> COL1 COL2 COL3
#> 1 G1 1 6
#> 3 G1 3 7
#> 4 G1 4 9
#> 7 G1 7 6
#> 9 G1 9 7
#> 12 G1 12 8
#> 13 G1 13 7
Or only keep rows that don't match their own lag:
df[df$COL3 != dplyr::lag(df$COL3, default = pi),]
#> COL1 COL2 COL3
#> 1 G1 1 6
#> 3 G1 3 7
#> 4 G1 4 9
#> 7 G1 7 6
#> 9 G1 9 7
#> 12 G1 12 8
#> 13 G1 13 7
CodePudding user response:
Another possible solution, based on dplyr
:
library(dplyr)
df %>%
filter(COL3 != lead(COL3, default = Inf))
#> COL1 COL2 COL3
#> 1 G1 2 6
#> 2 G1 3 7
#> 3 G1 6 9
#> 4 G1 8 6
#> 5 G1 11 7
#> 6 G1 12 8
#> 7 G1 13 7
CodePudding user response:
using data.table
,
temp[COL3!=lag(COL3,default = 0)]
output
COL1 COL2 COL3
1: G1 1 6
2: G1 3 7
3: G1 4 9
4: G1 7 6
5: G1 9 7
6: G1 12 8
7: G1 13 7