I have a dataset that looks like this:
# A tibble: 987 × 2
time counts
<time> <chr>
1 07:33 1358
2 07:34 1072
3 07:35 112
4 07:36 316
5 07:37 0
6 07:38 16
7 07:39 32
8 07:40 0
9 07:41 0
10 07:42 92
# … with 977 more rows
I need to make a new variable that assigns a unique value to consecutive runs and restarts when counts == 0. Also, if the counter starts with a 0 and the next two consecutive rows have values <100, I want the counter to ignore those two rows and treat them as 0. The same goes for one row of counts >1000. So, if the following row of row with counts == 0 is greater than 1000, I want to treat it as a 0.
So, these are the conditions for the counter:
Restarts with 0
Ignores rows (treats them as 0) if immediately after the row where counts == 0, two consecutive rows have counts <100 or one row has counts >1000.
I have to do this because I need to delete the rows that have 60 or more consecutive rows with counts == 0, BUT these rows can include two consecutive rows with counts <100 or one with counts >1000.
I'm not sure if this makes sense. Hopefully, it does!
I have tried the following:
db %>%
mutate(consec_id = rleid(counts==0))
But I don't know how to tell R to ignore certain rows based on specific conditions.
I also tried using a loop, but I am very new to loops, so I didn't get what I wanted:
# Loop through the dataset and update the "wear_status" variable
for (i in 2:nrow(day1)) {
# Check if current counts = 0
if (db$counts[i] == 0) {
consec_counts <- 0
db$wear_status[i] <- 0
} else {
# Check if current counts are between 1 and 100
if (db$counts[i] > 1 || db$counts[i] < 100) {
consec_counts <- consec_counts 1
} else {
consec_counts <- 0
}
# Check if current counts > 100 or 3 consecutive rows of counts = 1 or 100
if (db$counts[i] > 100 || consec_counts <= 2) {
consec_counts <- 0
db$wear_status[i] <- 1
}
}
}
This is my expected output:
# A tibble: 987 × 2
time counts counter
<time> <chr>
1 07:33 1358 1
2 07:34 1072 1
3 07:35 112 1
4 07:36 316 1
5 07:37 0 2
6 07:38 16 2
7 07:39 32 2
8 07:40 0 2
9 07:41 0 2
10 07:42 92 3
11 07:43 80 3
12 07:44 78 3
13 07:45 0 4
# … with 977 more rows
Thank you!
CodePudding user response:
This is something to try without using a loop.
You have 3 conditions that would result in the counts
value becoming zero:
- If the prior value is 0, the current is < 100, the following is < 100, and the one after that (2 values after the current value) is 0.
- If the prior to the prior value is 0, the prior value is < 100, the current value is < 100 and the following value is 0.
- If the previous value is 0, the current is greater than 1000, and the following value is 0.
You can use lag
and lead
to look at prior and following values. If you use a number, such as lag(counts, n = 2)
that will consider 2 values away (n is the number of positions to lag
or lead
by).
library(tidyverse)
library(data.table)
df %>%
mutate(new_counts = ifelse(
(lag(counts) == 0 & counts < 100 & lead(counts) < 100 & lead(counts, 2) == 0) |
(lag(counts, 2) == 0 & lag(counts) < 100 & counts < 100 & lead(counts) == 0) |
(lag(counts) == 0 & counts > 1000 & lead(counts) == 0),
0,
counts
)) %>%
mutate(counter = rleid(new_counts == 0))
Output
time counts new_counts counter
1 07:33 1358 1358 1
2 07:34 1072 1072 1
3 07:35 112 112 1
4 07:36 316 316 1
5 07:37 0 0 2
6 07:38 16 0 2
7 07:39 32 0 2
8 07:40 0 0 2
9 07:41 0 0 2
10 07:42 92 92 3
11 07:43 80 80 3
12 07:44 78 78 3
13 07:45 0 0 4
Data
df <- structure(list(time = c("07:33", "07:34", "07:35", "07:36", "07:37",
"07:38", "07:39", "07:40", "07:41", "07:42", "07:43", "07:44",
"07:45"), counts = c(1358L, 1072L, 112L, 316L, 0L, 16L, 32L,
0L, 0L, 92L, 80L, 78L, 0L)), row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13"), class = "data.frame")