I would like to count the number of successive occurrences by two groups while ignoring missing values. Suppose I have a dataset with 3 identifiers, one identifies the country, another variable that identifies the person, another refers to time, and a dummy variable that identifies the occurences. Something like this:
df <- data.frame(country = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
person = c("a", "a", "a", "b", "b", "a", "a", "b", "b", "b","b"),
time = c(1,2,3,1,2,1,2,1,2,3,4),
ocurrences = c(1,1,NA,1,0,0,NA,1,1,0,1))
Which will look like:
country person time ocurrences
1 A a 1 1
2 A a 2 1
3 A a 3 NA
4 A b 1 1
5 A b 2 0
6 B a 1 0
7 B a 2 NA
8 B b 1 1
9 B b 2 1
10 B b 3 0
11 B b 4 1
I would like to generate another column with the number of consecutive occurrences by these two groups (country, person) across time. At this point, I am not sure what is the best way to deal with missing values, whether to account them as 0 or leave it as missings. For now, let's just say I will leave them as missing values. So the column should begin counting the number of successive ocurrences from the first 0 by person in a country. If the person has a 0 in between, it should stop computing and begin again on the next occurrence. So the output should look like this:
country person time ocurrences count
1 A a 1 1 1
2 A a 2 1 2
3 A a 3 NA NA
4 A b 1 1 1
5 A b 2 0 0
6 B a 1 0 0
7 B a 2 NA NA
8 B b 1 1 1
9 B b 2 1 2
10 B b 3 0 0
11 B b 4 1 1
I edited and added another row as an example. Notice that rows 8 and 9 the column counts the number of successive occurrences, but it stops in row 10 because there was no occurrence. Then it starts computing again in row 11. What is the best way to accomplish it?
CodePudding user response:
Well, in order to group and cumsum with your conditions, i must create first an auxiliary variable rleid (note that a function with the same name that do this can be found in package data.table
)
Then group by country,person, relid, then cumsum ocurrences. NA's are propagated naturally.
library(dplyr)
df <- data.frame(country = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
person = c("a", "a", "a", "b", "b", "a", "a", "b", "b", "b","b"),
time = c(1,2,3,1,2,1,2,1,2,3,4),
ocurrences = c(1,1,NA,1,0,0,NA,1,1,0,1))
df$rleid = with(rle(ifelse(df$ocurrences==0,NA,df$ocurrences)), rep(seq_along(values), lengths))
df <- mutate(group_by(df, country, person, rleid), count=if_else(ocurrences>0,cumsum(ocurrences),0))
df <- ungroup(df)
df$rleid<-NULL
df
#> # A tibble: 11 × 5
#> country person time ocurrences count
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 A a 1 1 1
#> 2 A a 2 1 2
#> 3 A a 3 NA NA
#> 4 A b 1 1 1
#> 5 A b 2 0 0
#> 6 B a 1 0 0
#> 7 B a 2 NA NA
#> 8 B b 1 1 1
#> 9 B b 2 1 2
#> 10 B b 3 0 0
#> 11 B b 4 1 1
See this for an explanation of rleid, which basically allows to create groups when data changes with respect to previous row.