I have been trying do an data analyst exercise in R with date about sales. The dataframe is like:
Order_ID Product
<dbl> <chr>
1 319631 34in Ultrawide Monitor
2 319631 Lightning Charging Cable
3 319596 iPhone
4 319596 Lightning Charging Cable
5 319584 iPhone
6 319584 Wired Headphones
7 319556 Google Phone
8 319556 Wired Headphones
And I have to find which products are bought together most often, the Order_ID has duplicates, i.e. they were purchased by the same person.
I did this excercise in Python but I can't do it on R. My Python code was:
pares_compras[['Order ID', 'Product']]
> Order ID Product
2 176560 Google Phone
3 176560 Wired Headphones
17 176574 Google Phone
18 176574 USB-C Charging Cable
29 176585 Bose SoundSport Headphones
pares_compras.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
>
2 Google Phone,Wired Headphones
3 Google Phone,Wired Headphones
17 Google Phone,USB-C Charging Cable
18 Google Phone,USB-C Charging Cable
pares_compras = pares_compras[['Order ID', 'Grouped Products']].drop_duplicates()
pares_compras
> Order ID Grouped Products
2 176560 Google Phone,Wired Headphones
17 176574 Google Phone,USB-C Charging Cable
29 176585 Bose SoundSport Headphones,Bose SoundSport Hea...
31 176586 AAA Batteries (4-pack),Google Phone
118 176672 Lightning Charging Cable,USB-C Charging Cable
count = Counter()
for row in pares_compras['Grouped Products']:
row_list = row.split(',')
count.update(Counter(combinations(row_list, 2)))
count
> Counter({('Google Phone', 'Wired Headphones'): 414,
('Google Phone', 'USB-C Charging Cable'): 987,
('Bose SoundSport Headphones', 'Bose SoundSport Headphones'): 27, ... )}
for key, num in count.most_common(5):
print(key, num)
>
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
So, with that I could solve the exercise but, like I said before, I can't do the same in R, I don't find the way, I've just started in R. If anyone can help me I would really appreciate it, thnaks.
CodePudding user response:
Hmm I think that a co-occurence matrix would be a good solution actually.
another approach is to think about how different or similar the product profiles are.
orders <- read.csv(header = TRUE, text ='
"row", "order", "product"
1, 319631, "34in Ultrawide Monitor"
2, 319631, "Lightning Charging Cable"
3, 319596, "iPhone"
4, 319596, "Lightning Charging Cable"
5, 319584, "iPhone"
6, 319584, "Wired Headphones"
7, 319556, "Google Phone"
8, 319556, "Wired Headphones"') |>
dplyr::mutate(product = trimws(product))
df <- tidyr::pivot_wider(orders,
values_from = product,
names_from = product,
id_cols = order) |>
dplyr::mutate(across( `34in Ultrawide Monitor`:`Google Phone` ,
~!is.na(.x))) |>
select(-order)
cor(df)
dist(t(df))
dist(t(df), method = "binary")
CodePudding user response:
I initially closed your question as a duplicate of Creating co-occurrence matrix, but actually, solving your problem needs much more code. You don't just want a co-occurrence matrix; you want a data frame with co-occurrence frequency in decreasing order.
Here is your sample data frame:
dat <- structure(list(Order_ID = c(319631L, 319631L, 319596L, 319596L,
319584L, 319584L, 319556L, 319556L), Product = c("34in Ultrawide Monitor",
"Lightning Charging Cable", "iPhone", "Lightning Charging Cable",
"iPhone", "Wired Headphones", "Google Phone", "Wired Headphones"
)), class = "data.frame", row.names = c(NA, -8L))
First, we coerce Product
to a factor:
dat$Product <- as.factor(dat$Product)
Then, we create a sparse co-occurrence matrix:
library(Matrix)
mat <- crossprod(xtabs(~ Order_ID Product, dat, sparse = TRUE))
The novel part of this Q & A starts here.
## convert mat to (i, j, x) data frame
out <- summary(mat)
## filter out i == j
out <- subset(out, i < j)
## sort frequency in decreasing order
out <- out[order(out$x, decreasing = TRUE), ]
## now replace (i, j) by actual product names
out$i <- levels(dat$Product)[out$i]
out$j <- levels(dat$Product)[out$j]
## construct the resulting data frame
data.frame(pair = sprintf("(%s, %s)", out$i, out$j), frequency = out$x)
# pair frequency
#1 (34in Ultrawide Monitor, Lightning Charging Cable) 1
#2 (iPhone, Lightning Charging Cable) 1
#3 (Google Phone, Wired Headphones) 1
#4 (iPhone, Wired Headphones) 1
All frequencies are 1, because you did not provide enough data.
Yes! This is what I was looking for. But I don't know why my Python code gives different values. My code could be wrong. I will check it out later. My Python looks at all the combinations that can be make with 2 products and then I count the occurrences of each possible combination.
I understand. I also implemented the same logic here: Build a sparse matrix with items coexistence frequency (to analyze cross-selling of products)(see function Contingency
). I was later referred to the more elegant and efficient solution using xtabs
. But both methods give identical results.
Did you assume that each order only contains two products, in your Python code? This may be too restrictive. Of course, this may be true in your case; at least it is something I could spot from your sample data frame of 8 rows.
Yes! Only two. I know it's not realistic but for now it is enough for me.
My answer does not assume only 2 products per order. It is more general.
CodePudding user response:
I'll just leave this here for you as an alternative for you.
Here I list the unique combinations in two data.frames and check for sameness using a nested apply
functions, counting the results with rowsums
after cbind
ing
a <- expand.grid(a = df$Product,b = df$Product) |>
rowwise() |>
mutate(c = list(sort(c(a, b))), a = c[[1]], b = c[[2]]) |>
distinct() |>
filter(a != b)
b <- df |>
group_by(Order_ID) |>
summarise(Product = list(c(Product)))
a$count <- rowSums(do.call(cbind,
lapply(b$Product, \(one) sapply(a$c, \(two) (all(two %in% one))))))
a b count
<chr> <chr> <dbl>
1 34inUltrawideMonitor LightningChargingCable 1
2 34inUltrawideMonitor iPhone 0
3 34inUltrawideMonitor WiredHeadphones 0
4 34inUltrawideMonitor GooglePhone 0
5 iPhone LightningChargingCable 1
6 LightningChargingCable WiredHeadphones 0
7 GooglePhone LightningChargingCable 0
8 iPhone WiredHeadphones 1
9 GooglePhone iPhone 0
10 GooglePhone WiredHeadphones 1
CodePudding user response:
One way to quickly find the counts of each combination follows. I have altered the example dataframe slightly to show that the output gives the count.
require(data.table)
dat <- data.frame(Order_ID = c(319631L, 319631L, 319596L, 319596L,
319584L, 319584L, 319556L, 319556L, 319567L, 319567L), Product = c("34in Ultrawide Monitor",
"Lightning Charging Cable", "iPhone", "Lightning Charging Cable",
"iPhone", "Wired Headphones", "Google Phone", "Wired Headphones","iPhone", "Wired Headphones"
))
dat <- data.table(dat)
ans = dcast(dat, Order_ID~Product)
ans
Order_ID 34in Ultrawide Monitor Google Phone Lightning Charging Cable Wired Headphones iPhone
1: 319556 <NA> Google Phone <NA> Wired Headphones <NA>
2: 319567 <NA> <NA> <NA> Wired Headphones iPhone
3: 319584 <NA> <NA> <NA> Wired Headphones iPhone
4: 319596 <NA> <NA> Lightning Charging Cable <NA> iPhone
5: 319631 34in Ultrawide Monitor <NA> Lightning Charging Cable <NA> <NA>
aa = transpose(ans, make.names = T)
aa
319556 319567 319584 319596 319631
1: <NA> <NA> <NA> <NA> 34in Ultrawide Monitor
2: Google Phone <NA> <NA> <NA> <NA>
3: <NA> <NA> <NA> Lightning Charging Cable Lightning Charging Cable
4: Wired Headphones Wired Headphones Wired Headphones <NA> <NA>
5: <NA> iPhone iPhone iPhone <NA>
suppressWarnings(summary(factor(aa))) ##Gives the count of each combination
c(NA, NA, NA, "Wired Headphones", "iPhone")
2
c(NA, NA, "Lightning Charging Cable", NA, "iPhone")
1
c(NA, "Google Phone", NA, "Wired Headphones", NA)
1
c("34in Ultrawide Monitor", NA, "Lightning Charging Cable", NA, NA)
1
By placing the last line of code into an object called "new", you can make a new data frame for the output:
ansframe = data.frame(products = names(new), count = as.numeric(new))
ansframe
products count
1 c(NA, NA, NA, "Wired Headphones", "iPhone") 2
2 c(NA, NA, "Lightning Charging Cable", NA, "iPhone") 1
3 c(NA, "Google Phone", NA, "Wired Headphones", NA) 1
4 c("34in Ultrawide Monitor", NA, "Lightning Charging Cable", NA, NA) 1