R to calculate the unique count for Overlap Categories-CodePudding

I have been so frustrated to count the number of clients who made conversion or not in the 3 single categories (app, desktop, web) scenarios and the rest of the overlapping categories (app & Web, app & desktop, web & desktop, app & web & desktop) scenarios. Here is the sample dataset I am working on.

I could figure out the single category count with the aggregate and group_by function in r, however, I can't really figure out how to work on the overlap categories.

Really really thanks so much if someone could help me on this!!! Thanks!!!

df <- data.frame(list(ClientID = c("1", "1", "1", "2", "2", "3", "3" , "3" , "3" , "4" ),
                     device = c("App", "Web", "App", "Web", "Web", "App", "Desktop", "App", "App", "Web"),
                     conversion = c("0", "0", "0", "0", "1", "1", "0", "1", "0", "1")) )

Below is the desired outcome:

Scenario                With Conversion      Without Conversion 
App                          
Web                          
Desktop
App & Web
App & Desktop
Web & Desktop
App & Desktop & Web

CodePudding user response：

You can prepare a list of device combinations (i.e. all 7 possibilities), and then use setequal() and unique() in a helper function, like this

list of device combinations, using combn()

device_combinations = unlist(
  sapply(1:3, \(i) combn(c("App", "Web" ,"Desktop"), i, simplify = F)),
  recursive = F
)

Helper function, f (takes a list of devices, and indicates whether the unique list of devices meets each of the device combinations (it can meet at max one only, by using unique() and setequal()

f <- function(d) sapply(device_combinations, \(ds) setequal(ds, unique(d)))

Now apply the function by conversion and ClientID, unnest, count by conversion and Scenario and pivot to your desired wide format

df %>% 
  group_by(conversion,ClientID) %>%
  summarize(v = f(device), .groups="drop") %>%
  unnest(v) %>% 
  mutate(
    Scenario=rep(sapply(device_combinations,paste,collapse=","),length.out=n()),
    conversion =if_else(conversion=="0", "Without Conversion", "With Conversion")
  ) %>% 
  group_by(Scenario, conversion) %>% 
  summarize(ct = sum(v), .groups="drop") %>% 
  pivot_wider(id_cols = Scenario, names_from = conversion,values_from = ct)

Output:

  Scenario        `With Conversion` `Without Conversion`
  <chr>                       <int>                <int>
1 App                             1                    0
2 App,Desktop                     0                    1
3 App,Web                         0                    1
4 App,Web,Desktop                 0                    0
5 Desktop                         0                    0
6 Web                             2                    1
7 Web,Desktop                     0                    0

CodePudding user response：

I thought I would add my answer in addition to the one that's already here.

Updated based on your comment

This doesn't look all that neat, but it does do what you're expecting. Although the other answer looks far better.

I started by collecting the count of unique customers by conversion by device.

library(tidyverse)

dfn <- map(c("0", "1"),
           function(k) {
             with(df[df$conversion == k,], table(device, ClientID)) %>% 
               as.data.frame() %>% filter(Freq > 0) %>% select(-Freq) %>% 
               distinct() %>% group_by(device) %>% summarise(cnt = n())
             }
           )

dfa <- data.frame(device = unique(df$device)) %>% 
  left_join(., dfn[[1]]) %>% setNames(., c("device", "0")) %>% 
  left_join(., dfn[[2]])
names(dfa)[3] <- "1"
dfa[is.na(dfa)] <- 0
dfa
#    device 0 1
# 1     App 2 1
# 2     Web 2 2
# 3 Desktop 1 0

Then I wanted the combinations. There are only three here, so I could probably write the combinations faster than coding them. However, I've provided a more dynamic approach.

dbls = RcppAlgos::comboGeneral(dfa$device, 2)
#      [,1]    [,2]   
# [1,] App     Desktop
# [2,] App     Web    
# [3,] Desktop Web    
# Levels: App Desktop Web

Now I'll use data frame df to calculate the counts for the combinations determined in dbls.

dfb <- map2_dfr(rep(1:nrow(dbls), 2), rep(c("0", "1"), 3), 
    function(x, y){
      gimme = with(df[df$conversion == y, ], table(device, ClientID)) %>% 
        as.data.frame() %>% 
        filter(Freq > 0, device %in% c(dbls[x, 1], dbls[x, 2])) %>% 
        select(-Freq) %>% distinct() %>% group_by(ClientID) %>% 
        mutate(cnt = n()) %>% filter(cnt == 2)
      if(nrow(gimme) == 0){
        c(device = paste0(dbls[x, 1], " & ", dbls[x, 2]),
          wh = y, cnt = 0)
      } else {
        c(device = paste0(dbls[x, 1], " & ", dbls[x, 2]),
          wh = y, cnt = length(unique(gimme$ClientID)))
      }
    }) %>% pivot_wider(names_from = "wh", values_from = "cnt")
# # A tibble: 3 × 3
#   device        `0`   `1`  
#   <chr>         <chr> <chr>
# 1 App & Desktop 1     0    
# 2 App & Web     1     0    
# 3 Desktop & Web 0     0

Last, but not least, I combined the two frames.

I don't know which conversion (0 or 1) is with or without, so I just left the markers.

rbind(dfa, dfb)
#          device 0 1
# 1           App 2 1
# 2           Web 2 2
# 3       Desktop 1 0
# 4 App & Desktop 1 0
# 5     App & Web 1 0
# 6 Desktop & Web 0 0