I have a overview table - list of item count, the actual cost and the predicted cost
myData <- data.table("itemCount" = c(3000, 20, 50, 9),
"cost" = c(120, 118, 165, 93),
"prediction" = c(120, 100, 150, 120))
Then I calculate the individual and overall profit:
myData[, "profit" := cost/prediction]
total <- myData[, .(itemsTotal = sum(itemCount),
costTotal = sum(cost),
predictionTotal = sum(prediction))][
, "profit" := costTotal/predictionTotal
]
Now, for every row, I want to calculate what the overall profit would have been if that particular row was excluded from the analysis. For Example if row two was missing:
myData$diffinProfit <- NA
myDataEx <- myData[- 2, ]
totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
costTotal = sum(cost),
predictionTotal = sum(prediction))][
, "profit" := costTotal/predictionTotal
so I wrote a for loop to do this
myData$diffinProfit <- NA
for(observation in seq_along(length(myData)-1)){
myDataEx <- myData[- observation, ]
totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
costTotal = sum(cost),
predictionTotal = sum(prediction))][
, "profit" := costTotal/predictionTotal
]
myData$diffinProfit[[observation]] <- totalEx$profit
}
However, I only get result for the first observation. How can I find the for loop? Its there any way I could use an apply function? I was considering mapply? or maybe a purrr function?
CodePudding user response:
The first problem you have is that length(myData)
is reporting the number of columns, not the number of rows. But I think we can do without the for
loop (though sapply
is similar to it in deeper code).
myData[, otherProfit := sapply(seq_len(.N), function(z) sum(cost[-z])/sum(prediction[-z]))]
myData
# itemCount cost prediction profit otherProfit
# <num> <num> <num> <num> <num>
# 1: 3000 120 120 1.000 1.0162162
# 2: 20 118 100 1.180 0.9692308
# 3: 50 165 150 1.100 0.9735294
# 4: 9 93 120 0.775 1.0891892
Although mathematically, it's possible to do it without a loop at all:
sumcost <- sum(myData$cost)
sumpred <- sum(myData$prediction)
myData[, profit2 := (sumcost-cost)/(sumpred-prediction)]
myData
# itemCount cost prediction profit otherProfit profit2
# <num> <num> <num> <num> <num> <num>
# 1: 3000 120 120 1.000 1.0162162 1.0162162
# 2: 20 118 100 1.180 0.9692308 0.9692308
# 3: 50 165 150 1.100 0.9735294 0.9735294
# 4: 9 93 120 0.775 1.0891892 1.0891892
I'm not going to benchmark 4 rows, but I'd be surprised if this second "vectorized" approach weren't more efficient than the sapply
above or a for
-loop alternative.
CodePudding user response:
You can make use of row ids and the data.table native .GRP
group counter
library(data.table)
myData <- data.table("itemCount" = c(3000, 20, 50, 9),
"cost" = c(120, 118, 165, 93),
"prediction" = c(120, 100, 150, 120))
myData[, "profit" := cost/prediction]
# assign row ids
myData[, ID := .I]
# loop over each row and take all values that are not in the current row
# .GRP is a group identifier and since you loop over all rows, there are as many groups as rows
myData[, total_profit_excl := myData[ID != .GRP, sum(cost) / sum(prediction)],
by = ID]
myData
#> itemCount cost prediction profit ID total_profit_excl
#> 1: 3000 120 120 1.000 1 1.0162162
#> 2: 20 118 100 1.180 2 0.9692308
#> 3: 50 165 150 1.100 3 0.9735294
#> 4: 9 93 120 0.775 4 1.0891892