Home > Back-end >  Find and label groups of consecutive 0's in dataframe in R
Find and label groups of consecutive 0's in dataframe in R

Time:10-07

I have a dataframe that looks somewhat like this:

set.seed(0) 
    data <- data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
                 period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
                 industry = as.factor(c(rep("Finance", 3), "Culture", rep("Nutrition", 5), rep("Finance", 2), rep("Medicine", 7))),
                 present = rbinom(18, 1, prob = 0.3))

What I want to do is add an extra column where I count groups of 0's in the dataframe and label them.

The output I want should look like this:

set.seed(0)

  data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
             period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
             industry = as.factor(c(rep("Finance", 3), "Culture", rep("Nutrition", 5), rep("Finance", 2), rep("Medicine", 7))),
             present = rbinom(18, 1, prob = 0.3),
             present_count = c("1", "0_1", "0_1", "0_1", "1", "0_2", "1", "1", "0_3","0_3","0_3","0_3","0_3","0_3","0_3","1","0_4", "1"))

I know how to do this with a while loop (e.g. looping over the column and checking while the sum of values stays 0), but want to avoid doing so as the actual df is very large. Any tips?

CodePudding user response:

Here's a tidyverse solution

library(data.table)
library(dplyr)

data %>% 
  mutate(t = lag(present, default=TRUE) != present & present == 0, 
    c_p = cumsum(t), 
    present_count = ifelse(present == 0, paste0("0_",c_p), as.character(present)), 
    t = NULL, c_p = NULL)
      id period  industry present present_count
 1: C001      1   Finance       1             1
 2: C001      2   Finance       0           0_1
 3: C001      3   Finance       0           0_1
 4: C002      2   Culture       0           0_1
 5: C003      1 Nutrition       1             1
 6: C003      4 Nutrition       0           0_2
 7: C003      5 Nutrition       1             1
 8: C003      6 Nutrition       1             1
 9: C003     10 Nutrition       0           0_3
10: C004      3   Finance       0           0_3
11: C004      4   Finance       0           0_3
12: C005      2  Medicine       0           0_3
13: C005      3  Medicine       0           0_3
14: C005      4  Medicine       0           0_3
15: C005      7  Medicine       0           0_3
16: C005      8  Medicine       1             1
17: C005      9  Medicine       0           0_4
18: C005     10  Medicine       1             1

It checks the boundary of the groups and counts them.

A data.table solution based on the same principle.

data[, present_count := ifelse(present == 0, 
  paste0("0_", 
    cumsum(shift(present, n=1, fill=1, type="lag") != present & present == 0)), 
    present)]
data
      id period  industry present present_count
 1: C001      1   Finance       1             1
 2: C001      2   Finance       0           0_1
 3: C001      3   Finance       0           0_1
 4: C002      2   Culture       0           0_1
 5: C003      1 Nutrition       1             1
 6: C003      4 Nutrition       0           0_2
 7: C003      5 Nutrition       1             1
 8: C003      6 Nutrition       1             1
 9: C003     10 Nutrition       0           0_3
10: C004      3   Finance       0           0_3
11: C004      4   Finance       0           0_3
12: C005      2  Medicine       0           0_3
13: C005      3  Medicine       0           0_3
14: C005      4  Medicine       0           0_3
15: C005      7  Medicine       0           0_3
16: C005      8  Medicine       1             1
17: C005      9  Medicine       0           0_4
18: C005     10  Medicine       1             1
  • Related