Home > database >  Count groups in a column based on specific rules in R?
Count groups in a column based on specific rules in R?

Time:01-01

I'm trying to count groups in a column of data... but the count is based on some specific rules. In my column I have variable names (which I call values throughout this post) and NAs. Below are the rules and an example to illustrate what Im trying to do.

Rule 1: If 2 values appear beside each other (row-wise), then increase the count by 1.

Rule 2: If a selected value within the column of data has an NA and then a value in the 2 rows preceding it, then only increase the count by 1 if the total number of preceding values (not including NAs) is odd.

I know that can sound confusing, so hopefully my example will explain more clearly.

For example, if I have a data frame that looks like this:

library(dplyr)
df <- tibble(name = c("A", NA, "B", "C", NA, NA, NA,
                          "A", NA, "B", NA, NA,
                          "A", "B", "C", NA, NA, NA, "D", NA, "E", NA, NA),
                 num = c(1,1,1,1,1,1,1,
                         2,2,2,2,2,
                         3,3,3,3,3,3,3,3,3,3,3))
>df
# A tibble: 23 × 2
   name    num
   <chr> <dbl>
 1 A         1
 2 NA        1
 3 B         1
 4 C         1
 5 NA        1
 6 NA        1
 7 NA        1
 8 A         2
 9 NA        2
10 B         2
11 NA        2
12 NA        2
13 A         3
14 B         3
15 C         3
16 NA        3
17 NA        3
18 NA        3
19 D         3
20 NA        3
21 E         3
22 NA        3
23 NA        3

So, if we just look at the rows where num == 1 (i.e., df[1:7,]) and we apply our rules, we can see that nothing happens for rows 1 & 2. But on row 3 we can apply RULE 2. Because the 2 rows preceding B are NA and A. And since the number of values preceding B is odd (i.e., the only preceding value is A and we don't include NAs), we increase the count by 1.

Next we can see in rows 3 and 4 that we have two values beside each other (i.e., B, C) so we increase the count by 1.

So, when num == 1 we would get a count of 2.

Just for clarity, applying the same rules whennum == 3. Here, we can apply RULE 1 twice because we have A, B and B, C, so we increase the count by 2. Then for rows 16:19, nothing would happen and the count does not increase. Additionally, when we get to rows 19:21, we can see that the value E has an NA and D preceding it. So we apply RULE 2 and see if we increase the count. If we count all the values in this group that are above E, we can see that there is an even amount (i.e., A,B,C,D [excluding the NAs), so we do not increase the count.

So, when num == 3 we would get a count of 2.

I hope that makes things clear... my desired output would be something like:

    num count
  <dbl> <dbl>
1     1     2
2     2     1
3     3     2

Any suggestions as to how I could do this?

CodePudding user response:

Here is a dplyr solution.

library(dplyr)

should_increment <- function(x) {
  vals <- !is.na(x)
  l1_vals <- !is.na(lag(x))
  l2_vals <- !is.na(lag(x, 2L))
  vals & (l1_vals | (l2_vals & cumsum(vals) %% 2L == 0L))
}

df %>% group_by(num) %>% summarize(count = sum(should_increment(name)))

Output

# A tibble: 3 x 2
    num count
  <dbl> <int>
1     1     2
2     2     1
3     3     2

Simply speaking, the logic is (in pseudo-code)

if (current observation is not NA) {
  if (lag observation is not NA) {
    increase count by 1
  } else {
    if (lag 2 observation is not NA) {
      if (cumulative count of all non NA entries until the current one is even) {
        increase count by 1
      }
  }
}
  •  Tags:  
  • r
  • Related