How do I count the number of rows that contain a value earlier than another row in R?-CodePudding

I've imported data from a CSV into RStudio using read.table The data is of the type "list" and looks like this:

Client	Goal1	Goal2	Time
123	0	1	9:00
123	1	0	9:15
234	1	0	9:12
234	0	1	9:30

I need to calculate the number of Clients that reached both Goal1 and Goal2, but Goal2 has to be reached after the Client reached Goal1. So, in this example, Client 123 reached Goal2 before Goal1 and doesn't count. Client 234 reached Goal2 after Goal1 and does count.

I made a summary like this

scores %>% 
  summarise(count_all = n_distinct(Client), 
            count_goal_1 = uniqueN(Client[Goal1 > 0]), 
            count_goal_2 = uniqueN(Client[Goal2 > 0]), 
            count_overlap = uniqueN(ClientID[Goal1 > 0 & Goal2 > 0]),
            percentage_overlap = (count_overlap / count_goal_1)*100
  )

but I don't know how to make this conditional.

CodePudding user response：

We can filter with the desired logic on data grouped by Client, then summarise with n_distinct(). It is important to change the Time column to hour:minute time format, we can do that with lubridate::hm()

library(dplyr)

d %>%
        mutate(Time = lubridate::hm(Time)) %>%
        group_by(Client) %>%
        filter(any(Goal2==1 & Time > Time[Goal1==1])) %>%
        ungroup() %>%
        summarise(n = n_distinct(Client))

# A tibble: 1 × 1
      n
  <int>
1     1

CodePudding user response：

There are a few key things here:

pivot_longer to get the different Goals into a single column.
convert Time into an actual time format so you can calculate which was earlier.

library(tidyverse)

d <-
  read.table(header = T,
             text = "Client Goal1   Goal2   Time
                      123   0   1   9:00
                      123   1   0   9:15
                      234   1   0   9:12
                      234   0   1   9:30")

d %>%
  pivot_longer(
    starts_with("Goal"),
    names_to = "Goal",
    values_to = "is_goal",
    names_prefix = "Goal"
  ) %>%
  mutate(n_clients = length(unique(Client))) %>% # to keep for later as denominator of percentage
  mutate(Goal = as.integer(Goal)) %>% # turn to numeric so you can assess who got both
  filter(is_goal > 0) %>% # remove empty entries
  mutate(Time = hm(Time)) %>% # convert to time to calculate what was first
  group_by(Client) %>% # operate per-client
  filter(sum(Goal) == 3) %>%  # remove clients who didn't achieve both goals
  mutate(in_order = Time[Goal == 1] < Time[Goal == 2]) %>% # score whether goal 2 was after 1
  ungroup() %>%
  filter(in_order) %>% # remove clients who were not in order
  distinct(Client, n_clients) %>%
  summarise(percentage = 100 * nrow(.) / n_clients) # summarize as percentage
#> # A tibble: 1 x 1
#>   percentage
#>        <dbl>
#> 1         50

^{Created on 2021-12-28 by the reprex package (v0.3.0)}