I have a dataframe that looks somewhat like this:
set.seed(0)
data <- data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
industry = as.factor(c(rep("Finance", 3), "Culture", rep("Nutrition", 5), rep("Finance", 2), rep("Medicine", 7))),
present = rbinom(18, 1, prob = 0.3))
What I want to do is add an extra column where I count groups of 0's in the dataframe and label them.
The output I want should look like this:
set.seed(0)
data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
industry = as.factor(c(rep("Finance", 3), "Culture", rep("Nutrition", 5), rep("Finance", 2), rep("Medicine", 7))),
present = rbinom(18, 1, prob = 0.3),
present_count = c("1", "0_1", "0_1", "0_1", "1", "0_2", "1", "1", "0_3","0_3","0_3","0_3","0_3","0_3","0_3","1","0_4", "1"))
I know how to do this with a while loop (e.g. looping over the column and checking while the sum of values stays 0), but want to avoid doing so as the actual df is very large. Any tips?
CodePudding user response:
Here's a tidyverse
solution
library(data.table)
library(dplyr)
data %>%
mutate(t = lag(present, default=TRUE) != present & present == 0,
c_p = cumsum(t),
present_count = ifelse(present == 0, paste0("0_",c_p), as.character(present)),
t = NULL, c_p = NULL)
id period industry present present_count
1: C001 1 Finance 1 1
2: C001 2 Finance 0 0_1
3: C001 3 Finance 0 0_1
4: C002 2 Culture 0 0_1
5: C003 1 Nutrition 1 1
6: C003 4 Nutrition 0 0_2
7: C003 5 Nutrition 1 1
8: C003 6 Nutrition 1 1
9: C003 10 Nutrition 0 0_3
10: C004 3 Finance 0 0_3
11: C004 4 Finance 0 0_3
12: C005 2 Medicine 0 0_3
13: C005 3 Medicine 0 0_3
14: C005 4 Medicine 0 0_3
15: C005 7 Medicine 0 0_3
16: C005 8 Medicine 1 1
17: C005 9 Medicine 0 0_4
18: C005 10 Medicine 1 1
It checks the boundary of the groups and counts them.
A data.table
solution based on the same principle.
data[, present_count := ifelse(present == 0,
paste0("0_",
cumsum(shift(present, n=1, fill=1, type="lag") != present & present == 0)),
present)]
data
id period industry present present_count
1: C001 1 Finance 1 1
2: C001 2 Finance 0 0_1
3: C001 3 Finance 0 0_1
4: C002 2 Culture 0 0_1
5: C003 1 Nutrition 1 1
6: C003 4 Nutrition 0 0_2
7: C003 5 Nutrition 1 1
8: C003 6 Nutrition 1 1
9: C003 10 Nutrition 0 0_3
10: C004 3 Finance 0 0_3
11: C004 4 Finance 0 0_3
12: C005 2 Medicine 0 0_3
13: C005 3 Medicine 0 0_3
14: C005 4 Medicine 0 0_3
15: C005 7 Medicine 0 0_3
16: C005 8 Medicine 1 1
17: C005 9 Medicine 0 0_4
18: C005 10 Medicine 1 1