Passing Large data.frames to Apply-like Functions Efficiently in R-CodePudding

This is a question about resources and efficiency when passing large data sets to an apply-like function.

Example

[EDIT: changed example and discription to illustrate use of multiple tables and a calculation step per @UWE's comment]

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5 ...[snip]...

set.seed(10)

# Objective: Add period-cost per portion of person's selected fruit 

df.A <- data.frame(
  period = rep(1:3, times = 1, each = 4),
  Name = rep(c("John", "Paul","Ringo", "George"), times = 3),
  Fruit = sample(c("Apple", "Pear", "Banana", "Apple"), size = 12, replace = TRUE)
) # extend to many people, many periods, many fruit

df.B <- data.frame(
  Fruit = c("Pear", "Apple", "Banana"), 
  id = c(1, 2, 3),
  pound.per.portion = c(0.396832,0.440925,0.299829)
) # one entry per fruit

df.C <- data.frame(
  id = rep(1:3, times=3),
  period = rep(1:3, times = 1, each = 3),
  price.pound = c(2.33, 0.99, 2.15, 2.38, 1.01, 2.20, 2.42, 1.04, 2.25)
) # one entry per fruit per period

df.A
#>    period   Name  Fruit
#> 1       1   John Banana
#> 2       1   Paul  Apple
#> 3       1  Ringo   Pear
#> 4       1 George  Apple
#> 5       2   John  Apple
#> 6       2   Paul Banana
#> 7       2  Ringo  Apple
#> 8       2 George   Pear
#> 9       3   John Banana
#> 10      3   Paul Banana
#> 11      3  Ringo Banana
#> 12      3 George  Apple
df.B
#>    Fruit id pound.per.portion
#> 1   Pear  1          0.396832
#> 2  Apple  2          0.440925
#> 3 Banana  3          0.299829
df.C
#>   id period price.pound
#> 1  1      1        2.33
#> 2  2      1        0.99
#> 3  3      1        2.15
#> 4  1      2        2.38
#> 5  2      2        1.01
#> 6  3      2        2.20
#> 7  1      3        2.42
#> 8  2      3        1.04
#> 9  3      3        2.25

df.A$portion.price <- apply(df.A, MARGIN = 1, 
                           FUN = function(x, legend, prices){
                             # please ignore efficiency of this function
                             # the internal function is not the focus of the question
                             fruit.info <- df.B[df.B$Fruit == x[["Fruit"]],]

                             cost <- df.C %>% 
                               filter(period == x[["period"]],
                                      id == fruit.info[["id"]]) %>%
                               select(price.pound) %>%
                               `*`(fruit.info$pound.per.portion)
                             cost[[1]]
                           }, 
                           legend = df.B, prices = df.C) 
                            # Question relates to passing of legend and prices
                            # if `apply` passes df.B and df.C many times
                            # and df.B, df.C are large - is this inefficient, is there a better way

head(df.A, 5)  
#>   period   Name  Fruit portion.price
#> 1      1   John Banana     0.6446323
#> 2      1   Paul  Apple     0.4365157
#> 3      1  Ringo   Pear     0.9246186
#> 4      1 George  Apple     0.4365157
#> 5      2   John  Apple     0.4453343

^{Created on 2022-05-20 by the reprex package (v2.0.1)}

The objective in this example is to add a column to df.A that shows the portion cost for the fruit selected by a particular person in a particular period.

Three sets of data exist, though none of them alone have all the necessary information for the calculation.

df.A contains the people, the period and the name of the fruit they selected. An entry exists for each person in each period.

df.C has price information for the fruit by period, but the price is expressed as a price per pound, not portion, and the data set does not recognize names of fruit (only an id number). An entry exists for each fruit in each period.

df.B provides the missing information. First it defines df.C$id for df.A$Name, and it provides a factor to convert the cost per pound to a cost per portion. Only a single entry is required for each fruit.

For each row of df.A, apply passes the person's fruit name and period, along with the two reference sets (df.B and df.C) to the function. The function looks up the necessary infromation from df.B which it uses to reference data from df.C, which is then used to calculate a cost per portion (returned).

The function itself is not important to this question other than to illustrate use of multiple data sets to find a value for each row.

This example is simple (four people, three periods, three fruit) and very manageble with apply; however, each of these data sets could, in theory, comprise of thousands of rows.

Discussion Topic Starts Here

If I understand correctly, r passes values not references. I beleive that means that the apply function in the above example creates a new copy of df.B and df.C for each row of df.A. Assuming that is correct, this does not feel efficient, especally if the data sets are large.

Is there a better solution than apply for this sort of look up/processing when using large data sets?

I know rcpp functions can use references instead of values. Would one build a custom rccp function that acts like apply only using references, or is there a standard, off-the-shelf method?

CodePudding user response：

Using the apply() function with data.frames has a major drawback because apply() coerces the data.frame to a matrix before proceeding (see section 8.2.38 of Patrick Burns' The R Inferno).

As all elements of a matrix need to be of the same type all columns of the data.frame are coerced to one common data type.

This can be verified by

apply(df.A, MARGIN = 2, str)

 chr [1:12] "1" "1" "1" "1" "2" "2" "2" "2" "3" "3" "3" "3"
 chr [1:12] "John" "Paul" "Ringo" "George" "John" "Paul" "Ringo" "George" "John" "Paul" "Ringo" "George"
 chr [1:12] "Banana" "Apple" "Pear" "Apple" "Apple" "Banana" "Apple" "Pear" "Banana" "Banana" "Banana" ...

Here, also the integer column period is coerced to type character. This is costly and probably creates a copy of all data.

So what can we do instead to achieve OP's goal:

The objective in this example is to add a column to df.A that shows the portion cost for the fruit selected by a particular person in a particular period.

IMHO, the best way to achieve the objective is by joining two times.

First, a lookup table lut is created which contains the portion.price for each Fruit and period. Then, an update join is used to append the column to df.A:

library(data.table)
lut <- setDT(df.B)[df.C, on = .(id)][, portion.price := pound.per.portion * price.pound][]
setDT(df.A)[lut, on = .(Fruit, period), portion.price := i.portion.price][]

    period   Name  Fruit portion.price
 1:      1   John Banana     0.6446323
 2:      1   Paul  Apple     0.4365157
 3:      1  Ringo   Pear     0.9246186
 4:      1 George  Apple     0.4365157
 5:      2   John  Apple     0.4453343
 6:      2   Paul Banana     0.6596238
 7:      2  Ringo  Apple     0.4453343
 8:      2 George   Pear     0.9444602
 9:      3   John Banana     0.6746152
10:      3   Paul Banana     0.6746152
11:      3  Ringo Banana     0.6746152
12:      3 George  Apple     0.4585620

data.table was created to deal efficiently with large datasets.

Alternatively, SQL can be used:

sqldf::sqldf("
select period, Name, Fruit, `portion.price` from `df.A` 
  left join (
    select Fruit, period, 
      `pound.per.portion` * `price.pound` as `portion.price` from `df.B`  
      join `df.C` using(id) 
       ) using(period, Fruit)
")

   period   Name  Fruit portion.price
1       1   John Banana     0.6446323
2       1   Paul  Apple     0.4365157
3       1  Ringo   Pear     0.9246186
4       1 George  Apple     0.4365157
5       2   John  Apple     0.4453343
6       2   Paul Banana     0.6596238
7       2  Ringo  Apple     0.4453343
8       2 George   Pear     0.9444602
9       3   John Banana     0.6746152
10      3   Paul Banana     0.6746152
11      3  Ringo Banana     0.6746152
12      3 George  Apple     0.4585620

Note that some table and column names are enclosed in backticks because periods have a special meaning in SQL