I've imported data from a CSV into RStudio using read.table The data is of the type "list" and looks like this:
Client | Goal1 | Goal2 | Time |
---|---|---|---|
123 | 0 | 1 | 9:00 |
123 | 1 | 0 | 9:15 |
234 | 1 | 0 | 9:12 |
234 | 0 | 1 | 9:30 |
I need to calculate the number of Clients that reached both Goal1 and Goal2, but Goal2 has to be reached after the Client reached Goal1. So, in this example, Client 123 reached Goal2 before Goal1 and doesn't count. Client 234 reached Goal2 after Goal1 and does count.
I made a summary like this
scores %>%
summarise(count_all = n_distinct(Client),
count_goal_1 = uniqueN(Client[Goal1 > 0]),
count_goal_2 = uniqueN(Client[Goal2 > 0]),
count_overlap = uniqueN(ClientID[Goal1 > 0 & Goal2 > 0]),
percentage_overlap = (count_overlap / count_goal_1)*100
)
but I don't know how to make this conditional.
CodePudding user response:
We can filter with the desired logic on data grouped by Client, then summarise
with n_distinct()
. It is important to change the Time column to hour:minute time format, we can do that with lubridate::hm()
library(dplyr)
d %>%
mutate(Time = lubridate::hm(Time)) %>%
group_by(Client) %>%
filter(any(Goal2==1 & Time > Time[Goal1==1])) %>%
ungroup() %>%
summarise(n = n_distinct(Client))
# A tibble: 1 × 1
n
<int>
1 1
CodePudding user response:
There are a few key things here:
pivot_longer
to get the differentGoal
s into a single column.- convert
Time
into an actual time format so you can calculate which was earlier.
library(tidyverse)
d <-
read.table(header = T,
text = "Client Goal1 Goal2 Time
123 0 1 9:00
123 1 0 9:15
234 1 0 9:12
234 0 1 9:30")
d %>%
pivot_longer(
starts_with("Goal"),
names_to = "Goal",
values_to = "is_goal",
names_prefix = "Goal"
) %>%
mutate(n_clients = length(unique(Client))) %>% # to keep for later as denominator of percentage
mutate(Goal = as.integer(Goal)) %>% # turn to numeric so you can assess who got both
filter(is_goal > 0) %>% # remove empty entries
mutate(Time = hm(Time)) %>% # convert to time to calculate what was first
group_by(Client) %>% # operate per-client
filter(sum(Goal) == 3) %>% # remove clients who didn't achieve both goals
mutate(in_order = Time[Goal == 1] < Time[Goal == 2]) %>% # score whether goal 2 was after 1
ungroup() %>%
filter(in_order) %>% # remove clients who were not in order
distinct(Client, n_clients) %>%
summarise(percentage = 100 * nrow(.) / n_clients) # summarize as percentage
#> # A tibble: 1 x 1
#> percentage
#> <dbl>
#> 1 50
Created on 2021-12-28 by the reprex package (v0.3.0)