track last change that occurred across rows of a data frame in R-CodePudding

suppose I have a data frame that tracks changes in value across rows. If the value increases, the change is 'positive'. If the value decreases, the change is 'negative'. Otherwise, the change column says 'no change'

df = data.frame(sequence = 1:10, value = c(1,1,1, 2, 2,2,2,1,1,1), change = c('no change', 'no change', 'no change', 'positive', 'no change', 'no change', 'no change', 'negative', 'no change','no change'))

I want to create a new column that tracks the nature of the last change which occurred in the value column. in this case, the final data frame should look like this

df = data.frame(sequence = 1:10, value = c(1,1,1, 2, 2,2,2,1,1,1), change = c('no change', 'no change', 'no change', 'positive', 'no change', 'no change', 'no change', 'negative', 'no change','no change')), last_change = c('no change', 'no change', 'no change', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative')

How can I go about doing this? This is sample data - my actual data contains 3 million rows. Any help would be much appreciated.

CodePudding user response：

You can do this by leveraging data.table::rleid

library(data.table)

setDT(df)[, last_change:=first(change), rleid(value)]

Output:

    sequence value    change last_change
       <int> <num>    <char>      <char>
 1:        1     1 no change   no change
 2:        2     1 no change   no change
 3:        3     1 no change   no change
 4:        4     2  positive    positive
 5:        5     2 no change    positive
 6:        6     2 no change    positive
 7:        7     2 no change    positive
 8:        8     1  negative    negative
 9:        9     1 no change    negative
10:       10     1 no change    negative

CodePudding user response：

A dplyr one-liner, without grouping (faster, see benchmark):

library(dplyr)
df |>
  mutate(last_change = change[x <- lag(value, default = 0) != value][cumsum(x)])

output

   sequence value    change last_change
1         1     1 no change   no change
2         2     1 no change   no change
3         3     1 no change   no change
4         4     2  positive    positive
5         5     2 no change    positive
6         6     2 no change    positive
7         7     2 no change    positive
8         8     1  negative    negative
9         9     1 no change    negative
10       10     1 no change    negative

Since the question implies that this would be done on a large data set, I did a benchmark of the two solutions with dataframe with 1,000,000 rows:

library(dplyr)
library(data.table)
df2 <- do.call("rbind", replicate(100000, df, simplify = FALSE))

microbenchmark(
   f1 = mutate(df2,
               last_change = change[x <- lag(value, default = 0) != value][cumsum(x)]),
   f2 = setDT(df2)[, last_change:=first(change), rleid(value)]
 )
Unit: milliseconds
 expr      min       lq      mean   median         uq       max neval
   f1  27.2198  34.2597  47.15247  39.8145   49.72065  155.3829   100
   f2 767.8063 908.2213 996.29603 964.0844 1019.84905 1595.0273   100

The first function (the one proposed here) is much faster, I am guessing because it does not perform any grouping.

Explanation

x <- lag(value, default = 0) != value gives a logical vector of whether the value is different from the previous one (length 10), then change[] gives the vector of the values of change when x is TRUE (length 3), and then change[x][cumsum(x)] gives the values of change[x] according to the pattern of cumsum(x) (length 10), which is basically the position of the latest changed value.

(x <- with(df, lag(value, default = 0) != value))
# [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

with(df, cumsum(x))
# [1] 1 1 1 2 2 2 2 3 3 3

with(df, change[x])
#[1] "no change" "positive"  "negative" 

with(df, change[x][cumsum(x)])
#[1] "no change" "no change" "no change" "positive"  "positive"  "positive"  "positive" "negative"  "negative"  "negative"

CodePudding user response：

Using base R with rle

df$last_change <- with(df, ave(change, with(rle(value), 
    rep(seq_along(values), lengths)), FUN = function(x) x[1]))

-output

> df
   sequence value    change last_change
1         1     1 no change   no change
2         2     1 no change   no change
3         3     1 no change   no change
4         4     2  positive    positive
5         5     2 no change    positive
6         6     2 no change    positive
7         7     2 no change    positive
8         8     1  negative    negative
9         9     1 no change    negative
10       10     1 no change    negative