Home > other >  Rank ids according to variable that changes repetitively with value from another variable
Rank ids according to variable that changes repetitively with value from another variable

Time:10-08

Lets say I have a dataset of thousands of products. For all products I know how they are rated on a main_rating_platform and how they are rated on an alternative_rating_platform. It so happens that on average the products are rated worse on the alternative_rating_platform. This is a small example of the dataset:

df <- data.frame(product_id=c("a2a","zyz","xyz","9io","rop"), 
                 main_rating_platform = c(4.07,3.99,4.81,3.71,3.99),
                 alternative_rating_platform = c(3.67,3.59,4.21,3.71,3.67))

My ultimate end goal is a ranking of the product_ids according to the main_rating_platform, BUT with only this particular product rated according to the ratings of the alternative_rating_platform.

What I tried, and by now know how to do, is this: (and yes, I want it this way with products carrying the same rank, when they have the same rating :) )

library(dplyr)
df <- df %>% mutate(ranking_mainplatform = dense_rank(desc(main_rating_platform )))
df <- df %>% mutate(ranking_alternativeplatform = dense_rank(desc(alternative_rating_platform)))

But this is not what I need. I want to know which rank would product_id a2a have, if it were rated with the rating from the alternative_rating_platform, while - ceteris paribus - all other products remain with their rating from the main_rating_platform. For example, suddenly product a2a, instead of 4.07, would be rated with 3.67 stars. And then instead of being the second best product, it would actually be the worst, thus rank 5.

This should be the variable I hope to get eventually:

df$newranking_for_this_product_on_main_platform_but_with_rating_from_alternative_platform_ceterisparibus <- c(5,5,1,4,5)

I struggle to get my head around loops. If there is a solution that works without a loop and is computationally friendly with big data, it would be great. But if a loop is a must here, then so be it =)

CodePudding user response:

This might not scale well because you are ultimately iterating though each row twice (even though there's no explicit for loops, there are under the hood). Basically, in the code below, it goes through each product_id and creates a list of new ratings just subbing in that row's alternative rating. It then goes back through and calculates the rank with this sub 1 list.

library(dplyr)
library(purrr)

df <- data.frame(product_id=c("a2a","zyz","xyz","9io","rop"), 
                 main_rating_platform = c(4.07,3.99,4.81,3.71,3.99),
                 alternative_rating_platform = c(3.67,3.59,4.21,3.71,3.67))



df <- df |> 
  mutate(sub1Ratings = map(
    seq_along(product_id), 
    function(i, main, alt) {
      main[i] <- alt[i]
      main
    }, 
    main = main_rating_platform, 
    alt = alternative_rating_platform
  ))|> 
  mutate(
    sub1Rank = imap(sub1Ratings, ~dense_rank(desc(.x))[.y])
  ) 


as.integer(df$sub1Rank)
#> [1] 4 5 1 4 5

The output rankings doesn't exactly match what you had in the question because of how dense_rank handles ties.

  • Related