Home > front end >  Find and return column with max value from a list of vectors
Find and return column with max value from a list of vectors

Time:03-21

I'm trying to calculate favorite brand per productcategory per customer.

1. Start and what I want to achieve

I start with a dataframe (mydata) with purchase orders, build up like this:

customer ordernumber productcategory brand amount
ABC 123456 H11_plumbing Nitrofill 6
ABC 123457 H11_plumbing Antileak 2
DEF 123458 H11_plumbing Nitrofill 1
DEF 123459 H11_plumbing Antileak 5
ABC 123460 H12_electric Shock 10
ABC 123461 H12_electric Lightning 5
DEF 123462 H12_electric Shock 4
DEF 123463 H12_electric Lightning 8

What I want to achieve is list per customer the favorite brand per productcategory.

customer H11_plumbing_favorite_brand H12_electric_favorite_brand
ABC Nitrofill Shock
DEF Antileak Lightning

For customer ABC Nitrofill (amount = 6) and Shock (amount = 10) are favorites

2. What I am doing now

What I now did was create a list of vectors for each productcategory and change the shape from long to wide using lapply to use data.table::dcast

df_list <- split(mydata, as.factor(mydata$productcategory)) # create list of vectors

library(data.table)
df_list_2 <- lapply(df_list,function(x) x <- data.table::dcast(setDT(x), customer ~ brand, sum, value.var = c("amount"))) # change shape from long to wide

3. Where I get stuck is finding and returning the column with the favorite brand

This is where I get stuck. I have been able to do this for a data.frame (vector) rather than a list of vectors by using this code:

mydata_t <- mydata[mydata$productcategory=="H12_electric",]
mydata_overview<- data.table::dcast(setDT(mydata_t), customer ~ brand, sum, value.var = c("amount"))
rm(mydata_t)
mydata_overview$favorite_brand <- apply(mydata_overview[,-c(1)],1,function(x) which(x==max(x)))

However, if I try to use this code on the list of vectors (df_list) then it doesn't work.

df_list_3 <- lapply(df_list,function(x) x$favorite_brand<- apply(x[,-c(1)],1,function(y) which(y==max(y))))
rm(df_list_t)

Any suggestions?

CodePudding user response:

Here is a solution using data.table:

> library(data.table)
> setDT(dat)
> dcast(dat[ , .(favourite=brand[which.max(amount)]), by=.(customer, productcategory) ], customer ~ productcategory)

   customer H11_plumbing H12_electric
1:      ABC    Nitrofill        Shock
2:      DEF     Antileak    Lightning

To break it down:

We can use data.table to find the brand corresponding to the maximum amount for each combination of customer and category as follows:

> dat[ , .(favourite=brand[which.max(amount)]), by=.(customer, productcategory) ]
   customer productcategory favourite
1:      ABC    H11_plumbing Nitrofill
2:      DEF    H11_plumbing  Antileak
3:      ABC    H12_electric     Shock
4:      DEF    H12_electric Lightning

Then use dcast to reshape this table with customer as the column and productcategory along the rows.

CodePudding user response:

Here is one approach. Create a favorite column that includes the brand where you have the maximum amount (for each customer and productcategory combination). Then, use dcast to put into wide form, using the new favorite category as your value. The unique is added to prevent duplicates, which would force the value to be the number of entries when aggregating.

library(data.table)

setDT(df)

df[, favorite := brand[which.max(amount)], by = .(customer, productcategory)][
  , dcast(unique(.SD, by = c("customer", "productcategory")), 
          customer ~ productcategory, 
          value.var = "favorite")
]

Output

   customer H11_plumbing H12_electric
1:      ABC    Nitrofill        Shock
2:      DEF     Antileak    Lightning

Data

df <- structure(list(customer = c("ABC", "ABC", "DEF", "DEF", "ABC", 
"ABC", "DEF", "DEF"), ordernumber = 123456:123463, productcategory = c("H11_plumbing", 
"H11_plumbing", "H11_plumbing", "H11_plumbing", "H12_electric", 
"H12_electric", "H12_electric", "H12_electric"), brand = c("Nitrofill", 
"Antileak", "Nitrofill", "Antileak", "Shock", "Lightning", "Shock", 
"Lightning"), amount = c(6L, 2L, 1L, 5L, 10L, 5L, 4L, 8L)), class = "data.frame", row.names = c(NA, 
-8L))
  • Related