suppose I have a data frame that tracks changes in value across rows. If the value increases, the change is 'positive'. If the value decreases, the change is 'negative'. Otherwise, the change column says 'no change'
df = data.frame(sequence = 1:10, value = c(1,1,1, 2, 2,2,2,1,1,1), change = c('no change', 'no change', 'no change', 'positive', 'no change', 'no change', 'no change', 'negative', 'no change','no change'))
I want to create a new column that tracks the nature of the last change which occurred in the value column. in this case, the final data frame should look like this
df = data.frame(sequence = 1:10, value = c(1,1,1, 2, 2,2,2,1,1,1), change = c('no change', 'no change', 'no change', 'positive', 'no change', 'no change', 'no change', 'negative', 'no change','no change')), last_change = c('no change', 'no change', 'no change', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative')
How can I go about doing this? This is sample data - my actual data contains 3 million rows. Any help would be much appreciated.
CodePudding user response:
You can do this by leveraging data.table::rleid
library(data.table)
setDT(df)[, last_change:=first(change), rleid(value)]
Output:
sequence value change last_change
<int> <num> <char> <char>
1: 1 1 no change no change
2: 2 1 no change no change
3: 3 1 no change no change
4: 4 2 positive positive
5: 5 2 no change positive
6: 6 2 no change positive
7: 7 2 no change positive
8: 8 1 negative negative
9: 9 1 no change negative
10: 10 1 no change negative
CodePudding user response:
A dplyr
one-liner, without grouping (faster, see benchmark):
library(dplyr)
df |>
mutate(last_change = change[x <- lag(value, default = 0) != value][cumsum(x)])
output
sequence value change last_change
1 1 1 no change no change
2 2 1 no change no change
3 3 1 no change no change
4 4 2 positive positive
5 5 2 no change positive
6 6 2 no change positive
7 7 2 no change positive
8 8 1 negative negative
9 9 1 no change negative
10 10 1 no change negative
Since the question implies that this would be done on a large data set, I did a benchmark of the two solutions with dataframe with 1,000,000 rows:
library(dplyr)
library(data.table)
df2 <- do.call("rbind", replicate(100000, df, simplify = FALSE))
microbenchmark(
f1 = mutate(df2,
last_change = change[x <- lag(value, default = 0) != value][cumsum(x)]),
f2 = setDT(df2)[, last_change:=first(change), rleid(value)]
)
Unit: milliseconds
expr min lq mean median uq max neval
f1 27.2198 34.2597 47.15247 39.8145 49.72065 155.3829 100
f2 767.8063 908.2213 996.29603 964.0844 1019.84905 1595.0273 100
The first function (the one proposed here) is much faster, I am guessing because it does not perform any grouping.
Explanation
x <- lag(value, default = 0) != value
gives a logical vector of whether the value is different from the previous one (length 10), then change[]
gives the vector of the values of change
when x
is TRUE
(length 3), and then change[x][cumsum(x)]
gives the values of change[x]
according to the pattern of cumsum(x)
(length 10), which is basically the position of the latest changed value.
(x <- with(df, lag(value, default = 0) != value))
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
with(df, cumsum(x))
# [1] 1 1 1 2 2 2 2 3 3 3
with(df, change[x])
#[1] "no change" "positive" "negative"
with(df, change[x][cumsum(x)])
#[1] "no change" "no change" "no change" "positive" "positive" "positive" "positive" "negative" "negative" "negative"
CodePudding user response:
Using base R
with rle
df$last_change <- with(df, ave(change, with(rle(value),
rep(seq_along(values), lengths)), FUN = function(x) x[1]))
-output
> df
sequence value change last_change
1 1 1 no change no change
2 2 1 no change no change
3 3 1 no change no change
4 4 2 positive positive
5 5 2 no change positive
6 6 2 no change positive
7 7 2 no change positive
8 8 1 negative negative
9 9 1 no change negative
10 10 1 no change negative