Running total over time period by groups in R-CodePudding

I have a customer base dataset that consists of customer ids, their subscription, and unsubscription dates. The period is from 2016-2017 and one unique customer can subscribe again to the database after an unsubscription.

I would like to count the customer growth, taking into account how many unique customers have left and come, in a monthly period. However, I'm having trouble in terms of calculating the running total for unique ids over a monthly period.

Here's the dataset that I have:

structure(list(country = c("US", "US", "US", "US", "US", "US", 
"CH", "CH"), account_id = c(111L, 111L, 111L, 111L, 111L, 111L, 
123L, 123L), subscribed_at = c("2017-04-21 09:01:25", "2017-04-17 08:32:24", "2017-02-19 08:09:57", "2017-01-24 16:55:56", "2016-06-27 14:00:46", 
"2016-06-27 13:55:28", "2016-08-30 10:35:29", "2016-06-28 12:35:39"
), unsubscribed_at = c("2017-04-21 09:41:00", "2017-04-21 09:41:00", 
"2017-04-21 09:41:00", "2017-04-21 09:41:00", "2016-07-29 11:36:12", 
"2016-07-29 11:36:12", "2017-08-30 06:37:30", "2017-08-30 06:37:30"
)), class = "data.frame", row.names = c(NA, -8L))

country   account_id     subscribed_at     unsubscribed_at
1      US        111 2017-04-21 09:01:25 2017-04-21 09:41:00
2      US        111 2017-04-17 08:32:24 2017-04-21 09:41:00
3      US        111 2017-02-19 08:09:57 2017-04-21 09:41:00
4      US        111 2017-01-24 16:55:56 2017-04-21 09:41:00
5      US        111 2016-06-27 14:00:46 2016-07-29 11:36:12
6      US        111 2016-06-27 13:55:28 2016-07-29 11:36:12
7      CH        123 2016-08-30 10:35:29 2017-08-30 06:37:30
8      CH        123 2016-06-28 12:35:39 2017-08-30 06:37:30

And what I would like to achieve is a data frame consisting of the time period and how many users were there in the base before the given time period. So one unique id subscription is counted only once until the customer unsubscribes.

What I have tried is calculating aggregated counts for unique ids per month but that caused many duplicates since if the customer is in the base in 10/2016, then the count for 11/2016 would still be 1 even though there's another subscription from the same user in 11/2016 because it's still the same customer. Any suggestions on how should I calculate the problem?

CodePudding user response：

Let me know if this solution works. I pivot both subscription columns into the same column, numerically code them, make the rows distinct based on subscription, unsubscription, and resubscription, then finish with a cumulative sum.

library(tidyverse)
library(lubridate)

get_cumsum <- function(x) {
  country_name <- x[[1,2]]
  mutate(x, across(c(subscribed_at, unsubscribed_at), ymd_hm)) %>%
  pivot_longer(cols = c(subscribed_at, unsubscribed_at), names_to = "subscription", values_to = "date") %>%
  filter(!is.na(date)) %>%
  arrange(date) %>%
  mutate(subscription = as.integer(str_replace_all(subscription, c("unsubscribed_at" = "-1", "subscribed_at" = "1")))) %>%
  group_by(account_id) %>%
  mutate(already_unsubscribed_once = cumsum(subscription == -1)) %>%
  ungroup() %>%
  distinct(account_id, subscription, already_unsubscribed_once, .keep_all = TRUE) %>%
  mutate(running_total = cumsum(subscription)) %>%
  group_by(month = floor_date(date, 'month')) %>%
  summarise(running_total = running_total) %>%
  ungroup() %>%
  mutate(month = date(month)) %>%
  complete(month = seq.Date(from = min(month), to = max(month), by = "month")) %>%
  fill(running_total, .direction = "down") %>%
  group_by(month) %>%
  summarise(running_total = last(running_total)) %>%
  add_column(country = country_name)}

structure(list(X = 1:9, country_code = c("US", "US", "US", "US",  "US", "US", "CH", "CH", "CH"),
               account_id = c(111L, 111L, 111L,  111L, 111L, 111L, 123L, 123L, 123L),
               subscribed_at = c("2017-12-30 15:51",  "2017-06-08 15:55", "2017-04-15 11:03", "2016-12-24 15:53", "2016-09-06 20:10",  "2016-09-05 19:50", "2017-11-25 13:50", "2016-10-23 13:51", "2016-06-22 20:11" ),
               unsubscribed_at = c(NA, "2017-12-28 13:50", NA, NA, NA, NA,  NA, "2017-10-30 11:48", NA)), class = "data.frame", row.names = c(NA,  -9L)) %>%
  as_tibble() %>%
  group_by(country_code) %>%
  group_map(~get_cumsum(.), .keep = TRUE) %>%
  bind_rows()