R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently?-CodePudding

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.

What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.

# Generate Example Customer Data
set.seed(1)
n_customers <- 10

customer_df <-
  tibble(id = c(1:n_customers),
         score = sample(50:80, n_customers, replace = TRUE))

# Rating Map
rating_map <- tibble(
  max = c(
    47.0,
    53.0,
    57.0,
    60.5,
    63.0,
    65.5,
    67.3,
    69.7,
    71.7,
    74.0,
    76.3,
    79.0,
    82.5,
    85.5,
    100.00
  ),
  rating = c(15:1)
)

The best code that I've come up with to map the rating table onto the customer score data is as follows.

customer_df <-
  customer_df %>%
  mutate(rating = map(.x = score,
                      .f = ~max(select(filter(rating_map, .x < max),rating))
                      )
         ) %>%
  unnest(rating)

The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.

customer_df

# A tibble: 10 x 3
      id score rating
   <int> <int>  <int>
 1     1    74      5
 2     2    53     13
 3     3    56     13
 4     4    50     14
 5     5    51     14
 6     6    78      4
 7     7    72      6
 8     8    60     12
 9     9    63     10
10    10    67      9

I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?

Thanks!

CodePudding user response：

We could do a non-equi join

library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]

-output

     max rating    id
    <int>  <int> <int>
 1:    74      5     1
 2:    53     13     2
 3:    56     13     3
 4:    50     14     4
 5:    51     14     5
 6:    78      4     6
 7:    72      6     7
 8:    60     12     8
 9:    63     10     9
10:    67      9    10

Or another option in base R is with findInterval

customer_df$rating <- nrow(rating_map) - 
       findInterval(customer_df$score, rating_map$max)

-output

> customer_df
   id score rating
1   1    74      5
2   2    53     13
3   3    56     13
4   4    50     14
5   5    51     14
6   6    78      4
7   7    72      6
8   8    60     12
9   9    63     10
10 10    67      9

CodePudding user response：

This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.

It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.

cut divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

In this case you want the lowest rating for the highest score, hence the subtraction of the cut term.

customer_df %>%
  mutate(rating = length(rating_map$max) -  
           cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE))

EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.