In R, how do I keep the first single occurrence of a row based on a repeated value in one column?-CodePudding

I want to keep the row with the first occurrence of a changed value in a column (the last column in the example below). My dataframe is an xts object.

In the example below, I would keep the first row with a 2 in the last column, but not the next two because they are unchanged from the first 2. I'd then keep the next three rows (the sequence 323) because they change each time, and remove the next 4 because they didn't change, and so on. The final data frame would look like to smaller one below the original.

Any help is appreciated!

Original Dataframe

2007-01-31 2.72   4.75        2
2007-02-28 2.82   4.75        2
2007-03-31 2.85   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-07-31 4.19   4.75        3
2007-08-31 4.55   4.75        3
2007-09-30 4.20   4.75        3
2007-10-31 4.36   4.75        3
2007-11-30 5.75   4.76        4
2007-12-31 5.92   4.76        4
2008-01-31 6.95   4.87        4
2008-02-29 7.67   4.87        4
2008-03-31 8.21   4.90        4
2008-04-30 6.86   4.91        1
2008-05-31 6.53   5.07        1
2008-06-30 7.35   5.08        1
2008-07-31 8.00   5.13        4
2008-08-31 8.36   5.19        4

Final Dataframe

2007-01-31 2.72   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-11-30 5.75   4.76        4
2008-04-30 6.86   4.91        1
2008-07-31 8.00   5.13        4

CodePudding user response：

You can use data.table::shift to filter, plus the first row, in rbind

library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])

Or an equivalent approach using dplyr

library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))

Output:

         date    v1    v2    v3
       <IDat> <num> <num> <int>
1: 2007-01-31  2.72  4.75     2
2: 2007-04-30  2.74  4.75     3
3: 2007-05-31  2.46  4.75     2
4: 2007-06-30  2.98  4.75     3
5: 2007-11-30  5.75  4.76     4
6: 2008-04-30  6.86  4.91     1
7: 2008-07-31  8.00  5.13     4

CodePudding user response：

Here's another solution using run length encoding rle().

lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens   1,]

Output:

           V1   V2   V3 V4
1  2007-01-31 2.72 4.75  2
4  2007-04-30 2.74 4.75  3
5  2007-05-31 2.46 4.75  2
6  2007-06-30 2.98 4.75  3
11 2007-11-30 5.75 4.76  4
16 2008-04-30 6.86 4.91  1
19 2008-07-31 8.00 5.13  4

CodePudding user response：

DATA

x <- "
2007-01-31 2.72   4.75        2
2007-02-28 2.82   4.75        2
2007-03-31 2.85   4.75        2
2007-04-30 2.74   4.75        3
2007-05-31 2.46   4.75        2
2007-06-30 2.98   4.75        3
2007-07-31 4.19   4.75        3
2007-08-31 4.55   4.75        3
2007-09-30 4.20   4.75        3
2007-10-31 4.36   4.75        3
2007-11-30 5.75   4.76        4
2007-12-31 5.92   4.76        4
2008-01-31 6.95   4.87        4
2008-02-29 7.67   4.87        4
2008-03-31 8.21   4.90        4
2008-04-30 6.86   4.91        1
2008-05-31 6.53   5.07        1
2008-06-30 7.35   5.08        1
2008-07-31 8.00   5.13        4
2008-08-31 8.36   5.19        4
"
df <- read.table(textConnection(x) , header = F)

and use this two lines

df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]

#>            V1   V2   V3 V4
#> 1  2007-01-31 2.72 4.75  2
#> 4  2007-04-30 2.74 4.75  3
#> 5  2007-05-31 2.46 4.75  2
#> 6  2007-06-30 2.98 4.75  3
#> 11 2007-11-30 5.75 4.76  4
#> 16 2008-04-30 6.86 4.91  1
#> 19 2008-07-31 8.00 5.13  4

^{Created on 2022-06-12 by the reprex package (v2.0.1)}