I would like to generate a column with the average of differences between a subset of columns. Suppose I have the following dataset:
set.seed(123)
df <- data.frame( y = runif(1:10),
x1 = runif(1:10),
x2 = runif(1:10),
x3 = runif(1:10))
df[, "x1"][ df[, "x1"] < 0.5 ] <- NA
df[, "x3"][ df[, "x3"] > 0.7 ] <- NA
df
y x1 x2 x3
1 0.2875775 0.9568333 0.8895393 NA
2 0.7883051 NA 0.6928034 NA
3 0.4089769 0.6775706 0.6405068 0.69070528
4 0.8830174 0.5726334 0.9942698 NA
5 0.9404673 NA 0.6557058 0.02461368
6 0.0455565 0.8998250 0.7085305 0.47779597
7 0.5281055 NA 0.5440660 NA
8 0.8924190 NA 0.5941420 0.21640794
9 0.5514350 NA 0.2891597 0.31818101
10 0.4566147 0.9545036 0.1471136 0.23162579
I would like to get a column with the average for the difference between Xs and Y. So, in the example, I would like to get ((x1 - y) (x2 - y) (x3 - y))/(Number of X's). This gets a little complicated because of the missing values, not all rows will be calculated the same. For example, row 1 the value would compute ((x1 - y) (x2-y))/2, and row 7 it should just compute (x2 - y)/1, because there is only one value. How can I get this column? Let me know if you need more clarification.
CodePudding user response:
You can do this with rowMeans
in the wide format, to take care of the NA
issues:
rowMeans(df[c("x1","x2","x3")] - df[["y"]], na.rm=TRUE)
Ultimately though, I think this sort of x1/2/3
data is better off in a long-format with no NA
s, so that you can do any calculations you want using group_by
like logic without having to worry about excluding rows. Something like:
na.omit(cbind(rowid = seq_len(nrow(df)), stack(df[-1])))
# rowid values ind
#1 1 0.95683335 x1
#3 3 0.67757064 x1
#4 4 0.57263340 x1
#6 6 0.89982497 x1
#10 10 0.95450365 x1
#11 1 0.88953932 x2
#12 2 0.69280341 x2
# ...
CodePudding user response:
You could do:
library(dplyr)
df %>%
rowwise() %>%
mutate(result = mean(c_across(x1:x3) - y, na.rm = TRUE))
# A tibble: 10 × 5
# Rowwise:
y x1 x2 x3 result
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.288 0.957 0.890 NA 0.636
2 0.788 NA 0.693 NA -0.0955
3 0.409 0.678 0.641 0.691 0.261
4 0.883 0.573 0.994 NA -0.0996
5 0.940 NA 0.656 0.0246 -0.600
6 0.0456 0.900 0.709 0.478 0.650
7 0.528 NA 0.544 NA 0.0160
8 0.892 NA 0.594 0.216 -0.487
9 0.551 NA 0.289 0.318 -0.248
10 0.457 0.955 0.147 0.232 -0.0122
CodePudding user response:
Another solution using pmean
from kit
package:
df["result"] = kit::pmean(df[c("x1", "x2", "x3")]-df[["y"]], na.rm=TRUE)
y x1 x2 x3 result
1 0.2875775 0.9568333 0.8895393 NA 0.63560881
2 0.7883051 NA 0.6928034 NA -0.09550173
3 0.4089769 0.6775706 0.6405068 0.69070528 0.26061732
4 0.8830174 0.5726334 0.9942698 NA -0.09956581
5 0.9404673 NA 0.6557058 0.02461368 -0.60030754
6 0.0455565 0.8998250 0.7085305 0.47779597 0.64982730
7 0.5281055 NA 0.5440660 NA 0.01596054
8 0.8924190 NA 0.5941420 0.21640794 -0.48714407
9 0.5514350 NA 0.2891597 0.31818101 -0.24776464
10 0.4566147 0.9545036 0.1471136 0.23162579 -0.01220037
CodePudding user response:
data.table solution
setDT(df)
df[, result := rowMeans(.SD, na.rm = T) - y, .SDcols = 2:4]
results
y x1 x2 x3 result
1: 0.287578 0.95683 0.88954 NA 0.635609
2: 0.788305 NA 0.69280 NA -0.095502
3: 0.408977 0.67757 0.64051 0.690705 0.260617
4: 0.883017 0.57263 0.99427 NA -0.099566
5: 0.940467 NA 0.65571 0.024614 -0.600308
6: 0.045556 0.89982 0.70853 0.477796 0.649827
7: 0.528105 NA 0.54407 NA 0.015961
8: 0.892419 NA 0.59414 0.216408 -0.487144
9: 0.551435 NA 0.28916 0.318181 -0.247765
10: 0.456615 0.95450 0.14711 0.231626 -0.012200