Home > OS >  calculate the influence of one observation on overall result in R
calculate the influence of one observation on overall result in R

Time:08-24

I have a overview table - list of item count, the actual cost and the predicted cost

myData <- data.table("itemCount" = c(3000, 20, 50, 9),
                     "cost" = c(120, 118, 165, 93), 
                     "prediction" = c(120, 100, 150, 120))

Then I calculate the individual and overall profit:

myData[, "profit" := cost/prediction]

total <- myData[, .(itemsTotal = sum(itemCount),
                costTotal  = sum(cost), 
                predictionTotal = sum(prediction))][
                  , "profit" := costTotal/predictionTotal 
                ]

Now, for every row, I want to calculate what the overall profit would have been if that particular row was excluded from the analysis. For Example if row two was missing:

myData$diffinProfit <- NA
      myDataEx <- myData[- 2, ]
      totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
                              costTotal  = sum(cost), 
                              predictionTotal = sum(prediction))][
                                , "profit" := costTotal/predictionTotal 

so I wrote a for loop to do this

myData$diffinProfit <- NA
for(observation in seq_along(length(myData)-1)){
  
  myDataEx <- myData[- observation, ]
  totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
                          costTotal  = sum(cost), 
                          predictionTotal = sum(prediction))][
                            , "profit" := costTotal/predictionTotal 
                            ]
  
  myData$diffinProfit[[observation]] <- totalEx$profit
  
}

However, I only get result for the first observation. How can I find the for loop? Its there any way I could use an apply function? I was considering mapply? or maybe a purrr function?

CodePudding user response:

The first problem you have is that length(myData) is reporting the number of columns, not the number of rows. But I think we can do without the for loop (though sapply is similar to it in deeper code).

myData[, otherProfit := sapply(seq_len(.N), function(z) sum(cost[-z])/sum(prediction[-z]))]
myData
#    itemCount  cost prediction profit otherProfit
#        <num> <num>      <num>  <num>       <num>
# 1:      3000   120        120  1.000   1.0162162
# 2:        20   118        100  1.180   0.9692308
# 3:        50   165        150  1.100   0.9735294
# 4:         9    93        120  0.775   1.0891892

Although mathematically, it's possible to do it without a loop at all:

sumcost <- sum(myData$cost)
sumpred <- sum(myData$prediction)
myData[, profit2 := (sumcost-cost)/(sumpred-prediction)]
myData
#    itemCount  cost prediction profit otherProfit   profit2
#        <num> <num>      <num>  <num>       <num>     <num>
# 1:      3000   120        120  1.000   1.0162162 1.0162162
# 2:        20   118        100  1.180   0.9692308 0.9692308
# 3:        50   165        150  1.100   0.9735294 0.9735294
# 4:         9    93        120  0.775   1.0891892 1.0891892

I'm not going to benchmark 4 rows, but I'd be surprised if this second "vectorized" approach weren't more efficient than the sapply above or a for-loop alternative.

CodePudding user response:

You can make use of row ids and the data.table native .GRP group counter

library(data.table)

myData <- data.table("itemCount" = c(3000, 20, 50, 9),
                     "cost" = c(120, 118, 165, 93), 
                     "prediction" = c(120, 100, 150, 120))

myData[, "profit" := cost/prediction]

# assign row ids
myData[, ID := .I]

# loop over each row and take all values that are not in the current row
# .GRP is a group identifier and since you loop over all rows, there are as many groups as rows
myData[, total_profit_excl := myData[ID != .GRP, sum(cost) / sum(prediction)],
       by = ID]
myData
#>    itemCount cost prediction profit ID total_profit_excl
#> 1:      3000  120        120  1.000  1         1.0162162
#> 2:        20  118        100  1.180  2         0.9692308
#> 3:        50  165        150  1.100  3         0.9735294
#> 4:         9   93        120  0.775  4         1.0891892
  • Related