I have a dataframe with multiple rows representing repeated measures of the same individuals and then I am having a variable that represents some sequences coded from 1-5.
I want to find out how many times (number of events) across all individuals number 1, 2 and 3 appear after 4 and 5 in the P1 variable.
As a final product I want to have a dataframe where it will have as variables the category 4 and 5 and as rows the categories 1, 2 and 3 and it will show how many events category 4 had until it gets interrupted by either 1, 2 and 3 category and how many events category 5 had until it gets interrupted by either 1, 2 or 3 category.
Here is a small subset of my dataset
df <- data.frame(ID = as.factor(c(1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,5,5,5,5,6,6,6,6,6,6)),
P1 = c(1,5,2,4,1,5,2,4,2,5,1,4,2,4,1,5,3,5,1,4,2,5,1,4,1))
What I want is either a dataframe or some sort of results like the following meaning that after number 4 there were 3 events of number "1", 3 events of number "2" and 0 events of number "3".
Category_4 Category_5
1 3 2
2 3 2
3 0 1
Is there a way to do it with dplyr?
CodePudding user response:
If we are looking for the adjacent element that matches, then get the lag
and do a count
, and reshape to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(P2 = lag(P1)) %>%
filter(P1 %in% c(4, 5)) %>%
count(P1 = str_c("Category_", P1), P2) %>%
pivot_wider(names_from = P1, values_from = n, values_fill = 0)
-output
# A tibble: 3 × 3
P2 Category_4 Category_5
<dbl> <int> <int>
1 1 3 3
2 2 3 2
3 3 0 1