I'm trying to calculate favorite brand per productcategory per customer.
1. Start and what I want to achieve
I start with a dataframe (mydata) with purchase orders, build up like this:
customer | ordernumber | productcategory | brand | amount |
---|---|---|---|---|
ABC | 123456 | H11_plumbing | Nitrofill | 6 |
ABC | 123457 | H11_plumbing | Antileak | 2 |
DEF | 123458 | H11_plumbing | Nitrofill | 1 |
DEF | 123459 | H11_plumbing | Antileak | 5 |
ABC | 123460 | H12_electric | Shock | 10 |
ABC | 123461 | H12_electric | Lightning | 5 |
DEF | 123462 | H12_electric | Shock | 4 |
DEF | 123463 | H12_electric | Lightning | 8 |
What I want to achieve is list per customer the favorite brand per productcategory.
customer | H11_plumbing_favorite_brand | H12_electric_favorite_brand |
---|---|---|
ABC | Nitrofill | Shock |
DEF | Antileak | Lightning |
For customer ABC Nitrofill (amount = 6) and Shock (amount = 10) are favorites
2. What I am doing now
What I now did was create a list of vectors for each productcategory and change the shape from long to wide using lapply to use data.table::dcast
df_list <- split(mydata, as.factor(mydata$productcategory)) # create list of vectors
library(data.table)
df_list_2 <- lapply(df_list,function(x) x <- data.table::dcast(setDT(x), customer ~ brand, sum, value.var = c("amount"))) # change shape from long to wide
3. Where I get stuck is finding and returning the column with the favorite brand
This is where I get stuck. I have been able to do this for a data.frame (vector) rather than a list of vectors by using this code:
mydata_t <- mydata[mydata$productcategory=="H12_electric",]
mydata_overview<- data.table::dcast(setDT(mydata_t), customer ~ brand, sum, value.var = c("amount"))
rm(mydata_t)
mydata_overview$favorite_brand <- apply(mydata_overview[,-c(1)],1,function(x) which(x==max(x)))
However, if I try to use this code on the list of vectors (df_list
) then it doesn't work.
df_list_3 <- lapply(df_list,function(x) x$favorite_brand<- apply(x[,-c(1)],1,function(y) which(y==max(y))))
rm(df_list_t)
Any suggestions?
CodePudding user response:
Here is a solution using data.table
:
> library(data.table)
> setDT(dat)
> dcast(dat[ , .(favourite=brand[which.max(amount)]), by=.(customer, productcategory) ], customer ~ productcategory)
customer H11_plumbing H12_electric
1: ABC Nitrofill Shock
2: DEF Antileak Lightning
To break it down:
We can use data.table
to find the brand corresponding to the maximum amount for each combination of customer and category as follows:
> dat[ , .(favourite=brand[which.max(amount)]), by=.(customer, productcategory) ]
customer productcategory favourite
1: ABC H11_plumbing Nitrofill
2: DEF H11_plumbing Antileak
3: ABC H12_electric Shock
4: DEF H12_electric Lightning
Then use dcast
to reshape this table with customer
as the column and productcategory
along the rows.
CodePudding user response:
Here is one approach. Create a favorite
column that includes the brand
where you have the max
imum amount
(for each customer
and productcategory
combination). Then, use dcast
to put into wide form, using the new favorite
category as your value. The unique
is added to prevent duplicates, which would force the value to be the number of entries when aggregating.
library(data.table)
setDT(df)
df[, favorite := brand[which.max(amount)], by = .(customer, productcategory)][
, dcast(unique(.SD, by = c("customer", "productcategory")),
customer ~ productcategory,
value.var = "favorite")
]
Output
customer H11_plumbing H12_electric
1: ABC Nitrofill Shock
2: DEF Antileak Lightning
Data
df <- structure(list(customer = c("ABC", "ABC", "DEF", "DEF", "ABC",
"ABC", "DEF", "DEF"), ordernumber = 123456:123463, productcategory = c("H11_plumbing",
"H11_plumbing", "H11_plumbing", "H11_plumbing", "H12_electric",
"H12_electric", "H12_electric", "H12_electric"), brand = c("Nitrofill",
"Antileak", "Nitrofill", "Antileak", "Shock", "Lightning", "Shock",
"Lightning"), amount = c(6L, 2L, 1L, 5L, 10L, 5L, 4L, 8L)), class = "data.frame", row.names = c(NA,
-8L))