I have this data (original data has more rows) below :
structure(list(ID = 1:41, X1 = c(1921498, 2519663, 2519663, 107388,
2519663, 52211, 2519663, 62831, 62831, 62831, 62831, 62831, 62831,
62831, 4225203351, 4225203351, 4225203351, 4225203351, 4225203351,
4225203351, 4225203351, 171231, 171231, 171231, 183111, 171231,
190461, 190461, 190461, 190461, 190461, 190461, 183041, 190461,
191151, 210321, 210321, 210321, 210321, 211051, 211051)), class = "data.frame", row.names = c(NA,
-41L))
I'm trying to find a way to detect where a real change has happened. for example, I know in rows 4 and 6 no change happened because I went back to the value that I used often which is "2519663" then a real change happened in row 8 because I started using the value "62831" often. Then another change happened in row 15 because I started using the value "4225203351" and another one in row 22 because I started using the value "171231" but in row 25 I know no change happened because I used it the value "183111" one time and went back to "171231" again. and The change happened in row 27. another real change happened in row 36 because I don't use the value "190461" anymore. and the last change in my vector is in row 40:
My final result should look like this:
Thanks in advance !
CodePudding user response:
Here is a base R way. It marks as changed if there is a difference in values in more than 2 consecutive values.
df1 <-
structure(list(
ID = 1:41,
X1 = c(1921498, 2519663, 2519663, 107388,
2519663, 52211, 2519663, 62831, 62831, 62831, 62831, 62831, 62831,
62831, 4225203351, 4225203351, 4225203351, 4225203351, 4225203351,
4225203351, 4225203351, 171231, 171231, 171231, 183111, 171231,
190461, 190461, 190461, 190461, 190461, 190461, 183041, 190461,
191151, 210321, 210321, 210321, 210321, 211051, 211051)),
class = "data.frame", row.names = c(NA, -41L))
changes <- function(x, col, newcol, thresh = 2L) {
r <- rle(x[[col]])
i <- r$lengths > thresh
r$values[!i] <- "no change"
rr <- inverse.rle(r)
rr <- as.integer(factor(rr))
j <- c(0, diff(rr)) != 0
x[[newcol]] <- "no change"
x[[newcol]][j] <- "change"
x
}
changes(df1, "X1", "var3")
#> ID X1 var3
#> 1 1 1921498 no change
#> 2 2 2519663 no change
#> 3 3 2519663 no change
#> 4 4 107388 no change
#> 5 5 2519663 no change
#> 6 6 52211 no change
#> 7 7 2519663 no change
#> 8 8 62831 change
#> 9 9 62831 no change
#> 10 10 62831 no change
#> 11 11 62831 no change
#> 12 12 62831 no change
#> 13 13 62831 no change
#> 14 14 62831 no change
#> 15 15 4225203351 change
#> 16 16 4225203351 no change
#> 17 17 4225203351 no change
#> 18 18 4225203351 no change
#> 19 19 4225203351 no change
#> 20 20 4225203351 no change
#> 21 21 4225203351 no change
#> 22 22 171231 change
#> 23 23 171231 no change
#> 24 24 171231 no change
#> 25 25 183111 change
#> 26 26 171231 no change
#> 27 27 190461 change
#> 28 28 190461 no change
#> 29 29 190461 no change
#> 30 30 190461 no change
#> 31 31 190461 no change
#> 32 32 190461 no change
#> 33 33 183041 change
#> 34 34 190461 no change
#> 35 35 191151 no change
#> 36 36 210321 change
#> 37 37 210321 no change
#> 38 38 210321 no change
#> 39 39 210321 no change
#> 40 40 211051 change
#> 41 41 211051 no change
Created on 2022-11-26 with reprex v2.0.2
CodePudding user response:
Here’s a tidyverse solution. It creates a helper column to track the current “real” value, then tests for change in that column.
library(dplyr)
library(tidyr)
dat %>%
mutate(cur_X1 = ifelse(
X1 == lead(X1) | X1 == lag(X1),
X1,
NA
)) %>%
fill(cur_X1, .direction = "downup") %>%
mutate(
var3 = replace_na(
ifelse(cur_X1 != lag(cur_X1), "Change", "No-change"),
"No-change"
),
cur_X1 = NULL
)
ID X1 var3
1 1 1921498 No-change
2 2 2519663 No-change
3 3 2519663 No-change
4 4 107388 No-change
5 5 2519663 No-change
6 6 52211 No-change
7 7 2519663 No-change
8 8 62831 Change
9 9 62831 No-change
10 10 62831 No-change
11 11 62831 No-change
12 12 62831 No-change
13 13 62831 No-change
14 14 62831 No-change
15 15 4225203351 Change
16 16 4225203351 No-change
17 17 4225203351 No-change
18 18 4225203351 No-change
19 19 4225203351 No-change
20 20 4225203351 No-change
21 21 4225203351 No-change
22 22 171231 Change
23 23 171231 No-change
24 24 171231 No-change
25 25 183111 No-change
26 26 171231 No-change
27 27 190461 Change
28 28 190461 No-change
29 29 190461 No-change
30 30 190461 No-change
31 31 190461 No-change
32 32 190461 No-change
33 33 183041 No-change
34 34 190461 No-change
35 35 191151 No-change
36 36 210321 Change
37 37 210321 No-change
38 38 210321 No-change
39 39 210321 No-change
40 40 211051 Change
41 41 211051 No-change
CodePudding user response:
I did not understand your concept of change objectively, but here is a method to determine if a change occurs in comparison to the previous value, which may be useful to you.
library(dplyr)
df %>%
mutate(
var2 = c(0,diff(X1)),
var2 = if_else(var2 == 0, "No-change","Change")
)
ID X1 var2
1 1 1921498 No-change
2 2 2519663 Change
3 3 2519663 No-change
4 4 107388 Change
5 5 2519663 Change
6 6 52211 Change
7 7 2519663 Change
8 8 62831 Change
9 9 62831 No-change
10 10 62831 No-change
11 11 62831 No-change
12 12 62831 No-change
13 13 62831 No-change
14 14 62831 No-change
15 15 4225203351 Change
16 16 4225203351 No-change
17 17 4225203351 No-change
18 18 4225203351 No-change
19 19 4225203351 No-change
20 20 4225203351 No-change
21 21 4225203351 No-change
22 22 171231 Change
23 23 171231 No-change
24 24 171231 No-change
25 25 183111 Change
26 26 171231 Change
27 27 190461 Change
28 28 190461 No-change
29 29 190461 No-change
30 30 190461 No-change
31 31 190461 No-change
32 32 190461 No-change
33 33 183041 Change
34 34 190461 Change
35 35 191151 Change
36 36 210321 Change
37 37 210321 No-change
38 38 210321 No-change
39 39 210321 No-change
40 40 211051 Change
41 41 211051 No-change