What products are most often sold together? - Analysis exercise-CodePudding

I have been trying do an data analyst exercise in R with date about sales. The dataframe is like:

   Order_ID Product                 
      <dbl> <chr>                   
 1   319631 34in Ultrawide Monitor  
 2   319631 Lightning Charging Cable
 3   319596 iPhone                  
 4   319596 Lightning Charging Cable
 5   319584 iPhone                  
 6   319584 Wired Headphones        
 7   319556 Google Phone            
 8   319556 Wired Headphones

And I have to find which products are bought together most often, the Order_ID has duplicates, i.e. they were purchased by the same person.

I did this excercise in Python but I can't do it on R. My Python code was:

pares_compras[['Order ID', 'Product']]

>  Order ID Product
2   176560  Google Phone
3   176560  Wired Headphones
17  176574  Google Phone
18  176574  USB-C Charging Cable
29  176585  Bose SoundSport Headphones

pares_compras.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))

>
2                             Google Phone,Wired Headphones
3                             Google Phone,Wired Headphones
17                        Google Phone,USB-C Charging Cable
18                        Google Phone,USB-C Charging Cable

pares_compras = pares_compras[['Order ID', 'Grouped Products']].drop_duplicates()
pares_compras

>   Order ID    Grouped Products
2   176560  Google Phone,Wired Headphones
17  176574  Google Phone,USB-C Charging Cable
29  176585  Bose SoundSport Headphones,Bose SoundSport Hea...
31  176586  AAA Batteries (4-pack),Google Phone
118 176672  Lightning Charging Cable,USB-C Charging Cable

count = Counter()

for row in pares_compras['Grouped Products']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 2)))
count

> Counter({('Google Phone', 'Wired Headphones'): 414,
         ('Google Phone', 'USB-C Charging Cable'): 987,
         ('Bose SoundSport Headphones', 'Bose SoundSport Headphones'): 27, ... )}

for key, num in count.most_common(5):
    print(key, num)
>
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361

So, with that I could solve the exercise but, like I said before, I can't do the same in R, I don't find the way, I've just started in R. If anyone can help me I would really appreciate it, thnaks.

CodePudding user response：

Hmm I think that a co-occurence matrix would be a good solution actually.

another approach is to think about how different or similar the product profiles are.

orders <- read.csv(header = TRUE, text ='
"row", "order", "product"
1,   319631, "34in Ultrawide Monitor"
2,   319631, "Lightning Charging Cable"
3,   319596, "iPhone"
4,   319596, "Lightning Charging Cable"
5,   319584, "iPhone"
6,   319584, "Wired Headphones"
7,   319556, "Google Phone"
8,   319556, "Wired Headphones"')  |>
  dplyr::mutate(product = trimws(product))

df <- tidyr::pivot_wider(orders,
                   values_from = product, 
                   names_from = product, 
                   id_cols = order) |>
  dplyr::mutate(across( `34in Ultrawide Monitor`:`Google Phone` ,
                        ~!is.na(.x))) |>
  select(-order) 
cor(df)

dist(t(df))
dist(t(df), method = "binary")

CodePudding user response：

I initially closed your question as a duplicate of Creating co-occurrence matrix, but actually, solving your problem needs much more code. You don't just want a co-occurrence matrix; you want a data frame with co-occurrence frequency in decreasing order.

Here is your sample data frame:

dat <- structure(list(Order_ID = c(319631L, 319631L, 319596L, 319596L, 
319584L, 319584L, 319556L, 319556L), Product = c("34in Ultrawide Monitor", 
"Lightning Charging Cable", "iPhone", "Lightning Charging Cable", 
"iPhone", "Wired Headphones", "Google Phone", "Wired Headphones"
)), class = "data.frame", row.names = c(NA, -8L))

First, we coerce Product to a factor:

dat$Product <- as.factor(dat$Product)

Then, we create a sparse co-occurrence matrix:

library(Matrix)
mat <- crossprod(xtabs(~ Order_ID   Product, dat, sparse = TRUE))

The novel part of this Q & A starts here.

## convert mat to (i, j, x) data frame
out <- summary(mat)
## filter out i == j
out <- subset(out, i < j)
## sort frequency in decreasing order
out <- out[order(out$x, decreasing = TRUE), ]
## now replace (i, j) by actual product names
out$i <- levels(dat$Product)[out$i]
out$j <- levels(dat$Product)[out$j]
## construct the resulting data frame
data.frame(pair = sprintf("(%s, %s)", out$i, out$j), frequency = out$x)
#                                                pair frequency
#1 (34in Ultrawide Monitor, Lightning Charging Cable)         1
#2                 (iPhone, Lightning Charging Cable)         1
#3                   (Google Phone, Wired Headphones)         1
#4                         (iPhone, Wired Headphones)         1

All frequencies are 1, because you did not provide enough data.

Yes! This is what I was looking for. But I don't know why my Python code gives different values. My code could be wrong. I will check it out later. My Python looks at all the combinations that can be make with 2 products and then I count the occurrences of each possible combination.

I understand. I also implemented the same logic here: Build a sparse matrix with items coexistence frequency (to analyze cross-selling of products)(see function Contingency). I was later referred to the more elegant and efficient solution using xtabs. But both methods give identical results.

Did you assume that each order only contains two products, in your Python code? This may be too restrictive. Of course, this may be true in your case; at least it is something I could spot from your sample data frame of 8 rows.

Yes! Only two. I know it's not realistic but for now it is enough for me.

My answer does not assume only 2 products per order. It is more general.

CodePudding user response：

I'll just leave this here for you as an alternative for you.

Here I list the unique combinations in two data.frames and check for sameness using a nested apply functions, counting the results with rowsums after cbinding

a <- expand.grid(a = df$Product,b = df$Product) |>
  rowwise() |> 
  mutate(c = list(sort(c(a, b))), a = c[[1]], b = c[[2]]) |> 
  distinct() |> 
  filter(a != b)
  
  b <- df |> 
  group_by(Order_ID) |> 
  summarise(Product = list(c(Product)))

     
a$count <- rowSums(do.call(cbind, 
lapply(b$Product, \(one) sapply(a$c, \(two)  (all(two %in% one))))))


   a                      b                      count
   <chr>                  <chr>                  <dbl>
 1 34inUltrawideMonitor   LightningChargingCable     1
 2 34inUltrawideMonitor   iPhone                     0
 3 34inUltrawideMonitor   WiredHeadphones            0
 4 34inUltrawideMonitor   GooglePhone                0
 5 iPhone                 LightningChargingCable     1
 6 LightningChargingCable WiredHeadphones            0
 7 GooglePhone            LightningChargingCable     0
 8 iPhone                 WiredHeadphones            1
 9 GooglePhone            iPhone                     0
10 GooglePhone            WiredHeadphones            1

CodePudding user response：

One way to quickly find the counts of each combination follows. I have altered the example dataframe slightly to show that the output gives the count.

require(data.table)

dat <- data.frame(Order_ID = c(319631L, 319631L, 319596L, 319596L, 
319584L, 319584L, 319556L, 319556L, 319567L, 319567L), Product = c("34in Ultrawide Monitor", 
"Lightning Charging Cable", "iPhone", "Lightning Charging Cable", 
"iPhone", "Wired Headphones", "Google Phone", "Wired Headphones","iPhone", "Wired Headphones"
))

dat <- data.table(dat)
ans = dcast(dat, Order_ID~Product)

ans
   Order_ID 34in Ultrawide Monitor Google Phone Lightning Charging Cable Wired Headphones iPhone
1:   319556                   <NA> Google Phone                     <NA> Wired Headphones   <NA>
2:   319567                   <NA>         <NA>                     <NA> Wired Headphones iPhone
3:   319584                   <NA>         <NA>                     <NA> Wired Headphones iPhone
4:   319596                   <NA>         <NA> Lightning Charging Cable             <NA> iPhone
5:   319631 34in Ultrawide Monitor         <NA> Lightning Charging Cable             <NA>   <NA>
 
aa = transpose(ans, make.names = T)
aa
                 319556           319567           319584                   319596                   319631
1:             <NA>             <NA>             <NA>                     <NA>   34in Ultrawide Monitor
2:     Google Phone             <NA>             <NA>                     <NA>                     <NA>
3:             <NA>             <NA>             <NA> Lightning Charging Cable Lightning Charging Cable
4: Wired Headphones Wired Headphones Wired Headphones                     <NA>                     <NA>
5:             <NA>           iPhone           iPhone                   iPhone                     <NA>
 
suppressWarnings(summary(factor(aa)))  ##Gives the count of each combination
                        c(NA, NA, NA, "Wired Headphones", "iPhone") 
                                                                  2 
                c(NA, NA, "Lightning Charging Cable", NA, "iPhone") 
                                                                  1 
                  c(NA, "Google Phone", NA, "Wired Headphones", NA) 
                                                                  1 
c("34in Ultrawide Monitor", NA, "Lightning Charging Cable", NA, NA) 
                                                                  1

By placing the last line of code into an object called "new", you can make a new data frame for the output:

ansframe = data.frame(products = names(new), count = as.numeric(new))
ansframe
                                                             products count
1                         c(NA, NA, NA, "Wired Headphones", "iPhone")     2
2                 c(NA, NA, "Lightning Charging Cable", NA, "iPhone")     1
3                   c(NA, "Google Phone", NA, "Wired Headphones", NA)     1
4 c("34in Ultrawide Monitor", NA, "Lightning Charging Cable", NA, NA)     1