This is a question about resources and efficiency when passing large data sets to an apply-like function.
Example
[EDIT: changed example and discription to illustrate use of multiple tables and a calculation step per @UWE's comment]
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5 ...[snip]...
set.seed(10)
# Objective: Add period-cost per portion of person's selected fruit
df.A <- data.frame(
period = rep(1:3, times = 1, each = 4),
Name = rep(c("John", "Paul","Ringo", "George"), times = 3),
Fruit = sample(c("Apple", "Pear", "Banana", "Apple"), size = 12, replace = TRUE)
) # extend to many people, many periods, many fruit
df.B <- data.frame(
Fruit = c("Pear", "Apple", "Banana"),
id = c(1, 2, 3),
pound.per.portion = c(0.396832,0.440925,0.299829)
) # one entry per fruit
df.C <- data.frame(
id = rep(1:3, times=3),
period = rep(1:3, times = 1, each = 3),
price.pound = c(2.33, 0.99, 2.15, 2.38, 1.01, 2.20, 2.42, 1.04, 2.25)
) # one entry per fruit per period
df.A
#> period Name Fruit
#> 1 1 John Banana
#> 2 1 Paul Apple
#> 3 1 Ringo Pear
#> 4 1 George Apple
#> 5 2 John Apple
#> 6 2 Paul Banana
#> 7 2 Ringo Apple
#> 8 2 George Pear
#> 9 3 John Banana
#> 10 3 Paul Banana
#> 11 3 Ringo Banana
#> 12 3 George Apple
df.B
#> Fruit id pound.per.portion
#> 1 Pear 1 0.396832
#> 2 Apple 2 0.440925
#> 3 Banana 3 0.299829
df.C
#> id period price.pound
#> 1 1 1 2.33
#> 2 2 1 0.99
#> 3 3 1 2.15
#> 4 1 2 2.38
#> 5 2 2 1.01
#> 6 3 2 2.20
#> 7 1 3 2.42
#> 8 2 3 1.04
#> 9 3 3 2.25
df.A$portion.price <- apply(df.A, MARGIN = 1,
FUN = function(x, legend, prices){
# please ignore efficiency of this function
# the internal function is not the focus of the question
fruit.info <- df.B[df.B$Fruit == x[["Fruit"]],]
cost <- df.C %>%
filter(period == x[["period"]],
id == fruit.info[["id"]]) %>%
select(price.pound) %>%
`*`(fruit.info$pound.per.portion)
cost[[1]]
},
legend = df.B, prices = df.C)
# Question relates to passing of legend and prices
# if `apply` passes df.B and df.C many times
# and df.B, df.C are large - is this inefficient, is there a better way
head(df.A, 5)
#> period Name Fruit portion.price
#> 1 1 John Banana 0.6446323
#> 2 1 Paul Apple 0.4365157
#> 3 1 Ringo Pear 0.9246186
#> 4 1 George Apple 0.4365157
#> 5 2 John Apple 0.4453343
Created on 2022-05-20 by the reprex package (v2.0.1)
The objective in this example is to add a column to df.A that shows the portion cost for the fruit selected by a particular person in a particular period.
Three sets of data exist, though none of them alone have all the necessary information for the calculation.
df.A
contains the people, the period and the name of the fruit they selected. An entry exists for each person in each period.
df.C
has price information for the fruit by period, but the price is expressed as a price per pound, not portion, and the data set does not recognize names of fruit (only an id number). An entry exists for each fruit in each period.
df.B
provides the missing information. First it defines df.C$id
for df.A$Name
, and it provides a factor to convert the cost per pound to a cost per portion. Only a single entry is required for each fruit.
For each row of df.A
, apply
passes the person's fruit name and period, along with the two reference sets (df.B
and df.C
) to the function. The function looks up the necessary infromation from df.B
which it uses to reference data from df.C
, which is then used to calculate a cost per portion (returned).
The function itself is not important to this question other than to illustrate use of multiple data sets to find a value for each row.
This example is simple (four people, three periods, three fruit) and very manageble with apply
; however, each of these data sets could, in theory, comprise of thousands of rows.
Discussion Topic Starts Here
If I understand correctly, r
passes values not references. I beleive that means that the apply
function in the above example creates a new copy of df.B
and df.C
for each row of df.A
. Assuming that is correct, this does not feel efficient, especally if the data sets are large.
Is there a better solution than apply
for this sort of look up/processing when using large data sets?
I know rcpp
functions can use references instead of values. Would one build a custom rccp
function that acts like apply
only using references, or is there a standard, off-the-shelf method?
CodePudding user response:
Using the apply()
function with data.frames has a major drawback because apply()
coerces the data.frame to a matrix before proceeding (see section 8.2.38 of Patrick Burns' The R Inferno).
As all elements of a matrix need to be of the same type all columns of the data.frame are coerced to one common data type.
This can be verified by
apply(df.A, MARGIN = 2, str)
chr [1:12] "1" "1" "1" "1" "2" "2" "2" "2" "3" "3" "3" "3" chr [1:12] "John" "Paul" "Ringo" "George" "John" "Paul" "Ringo" "George" "John" "Paul" "Ringo" "George" chr [1:12] "Banana" "Apple" "Pear" "Apple" "Apple" "Banana" "Apple" "Pear" "Banana" "Banana" "Banana" ...
Here, also the integer column period
is coerced to type character. This is costly and probably creates a copy of all data.
So what can we do instead to achieve OP's goal:
The objective in this example is to add a column to df.A that shows the portion cost for the fruit selected by a particular person in a particular period.
IMHO, the best way to achieve the objective is by joining two times.
First, a lookup table lut
is created which contains the portion.price
for each Fruit
and period
. Then, an update join is used to append the column to df.A
:
library(data.table)
lut <- setDT(df.B)[df.C, on = .(id)][, portion.price := pound.per.portion * price.pound][]
setDT(df.A)[lut, on = .(Fruit, period), portion.price := i.portion.price][]
period Name Fruit portion.price 1: 1 John Banana 0.6446323 2: 1 Paul Apple 0.4365157 3: 1 Ringo Pear 0.9246186 4: 1 George Apple 0.4365157 5: 2 John Apple 0.4453343 6: 2 Paul Banana 0.6596238 7: 2 Ringo Apple 0.4453343 8: 2 George Pear 0.9444602 9: 3 John Banana 0.6746152 10: 3 Paul Banana 0.6746152 11: 3 Ringo Banana 0.6746152 12: 3 George Apple 0.4585620
data.table
was created to deal efficiently with large datasets.
Alternatively, SQL can be used:
sqldf::sqldf("
select period, Name, Fruit, `portion.price` from `df.A`
left join (
select Fruit, period,
`pound.per.portion` * `price.pound` as `portion.price` from `df.B`
join `df.C` using(id)
) using(period, Fruit)
")
period Name Fruit portion.price 1 1 John Banana 0.6446323 2 1 Paul Apple 0.4365157 3 1 Ringo Pear 0.9246186 4 1 George Apple 0.4365157 5 2 John Apple 0.4453343 6 2 Paul Banana 0.6596238 7 2 Ringo Apple 0.4453343 8 2 George Pear 0.9444602 9 3 John Banana 0.6746152 10 3 Paul Banana 0.6746152 11 3 Ringo Banana 0.6746152 12 3 George Apple 0.4585620
Note that some table and column names are enclosed in backticks because periods have a special meaning in SQL