I'm trying to count groups in a column of data... but the count is based on some specific rules. In my column I have variable names (which I call values throughout this post) and NAs
. Below are the rules and an example to illustrate what Im trying to do.
Rule 1: If 2 values appear beside each other (row-wise), then increase the count by 1.
Rule 2: If a selected value within the column of data has an NA
and then a value in the 2 rows preceding it, then only increase the count by 1 if the total number of preceding values (not including NAs
) is odd.
I know that can sound confusing, so hopefully my example will explain more clearly.
For example, if I have a data frame that looks like this:
library(dplyr)
df <- tibble(name = c("A", NA, "B", "C", NA, NA, NA,
"A", NA, "B", NA, NA,
"A", "B", "C", NA, NA, NA, "D", NA, "E", NA, NA),
num = c(1,1,1,1,1,1,1,
2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3))
>df
# A tibble: 23 × 2
name num
<chr> <dbl>
1 A 1
2 NA 1
3 B 1
4 C 1
5 NA 1
6 NA 1
7 NA 1
8 A 2
9 NA 2
10 B 2
11 NA 2
12 NA 2
13 A 3
14 B 3
15 C 3
16 NA 3
17 NA 3
18 NA 3
19 D 3
20 NA 3
21 E 3
22 NA 3
23 NA 3
So, if we just look at the rows where num == 1
(i.e., df[1:7,]
) and we apply our rules, we can see that nothing happens for rows 1 & 2. But on row 3 we can apply RULE 2. Because the 2 rows preceding B
are NA
and A
. And since the number of values preceding B
is odd (i.e., the only preceding value is A
and we don't include NAs
), we increase the count by 1.
Next we can see in rows 3 and 4 that we have two values beside each other (i.e., B, C
) so we increase the count by 1.
So, when num == 1
we would get a count of 2.
Just for clarity, applying the same rules whennum == 3
. Here, we can apply RULE 1 twice because we have A, B
and B, C
, so we increase the count by 2. Then for rows 16:19, nothing would happen and the count does not increase. Additionally, when we get to rows 19:21, we can see that the value E
has an NA
and D
preceding it. So we apply RULE 2 and see if we increase the count. If we count all the values in this group that are above E
, we can see that there is an even amount (i.e., A,B,C,D
[excluding the NAs
), so we do not increase the count.
So, when num == 3
we would get a count of 2.
I hope that makes things clear... my desired output would be something like:
num count
<dbl> <dbl>
1 1 2
2 2 1
3 3 2
Any suggestions as to how I could do this?
CodePudding user response:
Here is a dplyr
solution.
library(dplyr)
should_increment <- function(x) {
vals <- !is.na(x)
l1_vals <- !is.na(lag(x))
l2_vals <- !is.na(lag(x, 2L))
vals & (l1_vals | (l2_vals & cumsum(vals) %% 2L == 0L))
}
df %>% group_by(num) %>% summarize(count = sum(should_increment(name)))
Output
# A tibble: 3 x 2
num count
<dbl> <int>
1 1 2
2 2 1
3 3 2
Simply speaking, the logic is (in pseudo-code)
if (current observation is not NA) {
if (lag observation is not NA) {
increase count by 1
} else {
if (lag 2 observation is not NA) {
if (cumulative count of all non NA entries until the current one is even) {
increase count by 1
}
}
}