I am trying to find classifications of A and B on the same day for the same name and count the total B classifications on the same day.
I included an example dataset and writeup below. The question I am trying to answer is "What percentage of B had an associated A?", which will also answer "What percentage of B did NOT have an associated A?"
On 2022-01-01 John Doe has A and B classification. Bruce Wayne also had a B classification, but no associated A. The output should show 1 instance of A and B happening together and 2 instances of B happening.
Date <- c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-02", "2022-01-02")
Name <- c("John Doe","John Doe","Peter Parker","Bruce Wayne","Bruce Wayne","Lebron James", "Jane Doe")
Classification <- c("A","B","B", "B", "A", "B", "B")
df <- data.frame(Date,Name,Classification)
df
date_output <- c("2022-01-01", "2022-01-02")
b_and_a_output <- c(1,2)
daily_total_b_output <- c(1,3)
desired_output <- data.frame(date_output, b_and_a_output, daily_total_output)
desired_output
CodePudding user response:
There's probably a million ways to approach this, but here's the old crossprod
trick for calculating co-occurrence of values:
library(dplyr)
df %>%
group_by(Date) %>%
summarise(
tmp = list(crossprod(table(Name, Classification))),
a_and_b = tmp[[1]]["A","B"],
total_b = tmp[[1]]["B","B"]
) %>%
select(-tmp)
## A tibble: 2 x 3
# Date a_and_b total_b
# <chr> <dbl> <dbl>
#1 2022-01-01 1 2
#2 2022-01-02 1 3