How to create a column with the average difference for a subset of columns?-CodePudding

I would like to generate a column with the average of differences between a subset of columns. Suppose I have the following dataset:

set.seed(123)
df <- data.frame( y = runif(1:10),
                  x1 = runif(1:10),
                  x2 = runif(1:10),
                  x3 = runif(1:10))
df[, "x1"][ df[, "x1"] < 0.5 ] <- NA
df[, "x3"][ df[, "x3"] > 0.7 ] <- NA

df
           y        x1        x2         x3
1  0.2875775 0.9568333 0.8895393         NA
2  0.7883051        NA 0.6928034         NA
3  0.4089769 0.6775706 0.6405068 0.69070528
4  0.8830174 0.5726334 0.9942698         NA
5  0.9404673        NA 0.6557058 0.02461368
6  0.0455565 0.8998250 0.7085305 0.47779597
7  0.5281055        NA 0.5440660         NA
8  0.8924190        NA 0.5941420 0.21640794
9  0.5514350        NA 0.2891597 0.31818101
10 0.4566147 0.9545036 0.1471136 0.23162579

I would like to get a column with the average for the difference between Xs and Y. So, in the example, I would like to get ((x1 - y) (x2 - y) (x3 - y))/(Number of X's). This gets a little complicated because of the missing values, not all rows will be calculated the same. For example, row 1 the value would compute ((x1 - y) (x2-y))/2, and row 7 it should just compute (x2 - y)/1, because there is only one value. How can I get this column? Let me know if you need more clarification.

CodePudding user response：

You can do this with rowMeans in the wide format, to take care of the NA issues:

rowMeans(df[c("x1","x2","x3")] - df[["y"]], na.rm=TRUE)

Ultimately though, I think this sort of x1/2/3 data is better off in a long-format with no NAs, so that you can do any calculations you want using group_by like logic without having to worry about excluding rows. Something like:

na.omit(cbind(rowid = seq_len(nrow(df)), stack(df[-1])))
#   rowid     values ind
#1      1 0.95683335  x1
#3      3 0.67757064  x1
#4      4 0.57263340  x1
#6      6 0.89982497  x1
#10    10 0.95450365  x1
#11     1 0.88953932  x2
#12     2 0.69280341  x2
# ...

CodePudding user response：

You could do:

library(dplyr)

df %>%
  rowwise() %>%
  mutate(result = mean(c_across(x1:x3) - y, na.rm = TRUE))

# A tibble: 10 × 5
# Rowwise: 
        y     x1    x2      x3  result
    <dbl>  <dbl> <dbl>   <dbl>   <dbl>
 1 0.288   0.957 0.890 NA       0.636 
 2 0.788  NA     0.693 NA      -0.0955
 3 0.409   0.678 0.641  0.691   0.261 
 4 0.883   0.573 0.994 NA      -0.0996
 5 0.940  NA     0.656  0.0246 -0.600 
 6 0.0456  0.900 0.709  0.478   0.650 
 7 0.528  NA     0.544 NA       0.0160
 8 0.892  NA     0.594  0.216  -0.487 
 9 0.551  NA     0.289  0.318  -0.248 
10 0.457   0.955 0.147  0.232  -0.0122

CodePudding user response：

Another solution using pmean from kit package:

df["result"] = kit::pmean(df[c("x1", "x2", "x3")]-df[["y"]], na.rm=TRUE)

           y        x1        x2         x3      result
1  0.2875775 0.9568333 0.8895393         NA  0.63560881
2  0.7883051        NA 0.6928034         NA -0.09550173
3  0.4089769 0.6775706 0.6405068 0.69070528  0.26061732
4  0.8830174 0.5726334 0.9942698         NA -0.09956581
5  0.9404673        NA 0.6557058 0.02461368 -0.60030754
6  0.0455565 0.8998250 0.7085305 0.47779597  0.64982730
7  0.5281055        NA 0.5440660         NA  0.01596054
8  0.8924190        NA 0.5941420 0.21640794 -0.48714407
9  0.5514350        NA 0.2891597 0.31818101 -0.24776464
10 0.4566147 0.9545036 0.1471136 0.23162579 -0.01220037

CodePudding user response：

data.table solution

setDT(df)

df[, result := rowMeans(.SD, na.rm = T) - y, .SDcols = 2:4]

results

           y      x1      x2       x3    result
 1: 0.287578 0.95683 0.88954       NA  0.635609
 2: 0.788305      NA 0.69280       NA -0.095502
 3: 0.408977 0.67757 0.64051 0.690705  0.260617
 4: 0.883017 0.57263 0.99427       NA -0.099566
 5: 0.940467      NA 0.65571 0.024614 -0.600308
 6: 0.045556 0.89982 0.70853 0.477796  0.649827
 7: 0.528105      NA 0.54407       NA  0.015961
 8: 0.892419      NA 0.59414 0.216408 -0.487144
 9: 0.551435      NA 0.28916 0.318181 -0.247765
10: 0.456615 0.95450 0.14711 0.231626 -0.012200