Home > database >  How to create a column with the average difference for a subset of columns?
How to create a column with the average difference for a subset of columns?

Time:10-19

I would like to generate a column with the average of differences between a subset of columns. Suppose I have the following dataset:

set.seed(123)
df <- data.frame( y = runif(1:10),
                  x1 = runif(1:10),
                  x2 = runif(1:10),
                  x3 = runif(1:10))
df[, "x1"][ df[, "x1"] < 0.5 ] <- NA
df[, "x3"][ df[, "x3"] > 0.7 ] <- NA

df
           y        x1        x2         x3
1  0.2875775 0.9568333 0.8895393         NA
2  0.7883051        NA 0.6928034         NA
3  0.4089769 0.6775706 0.6405068 0.69070528
4  0.8830174 0.5726334 0.9942698         NA
5  0.9404673        NA 0.6557058 0.02461368
6  0.0455565 0.8998250 0.7085305 0.47779597
7  0.5281055        NA 0.5440660         NA
8  0.8924190        NA 0.5941420 0.21640794
9  0.5514350        NA 0.2891597 0.31818101
10 0.4566147 0.9545036 0.1471136 0.23162579

I would like to get a column with the average for the difference between Xs and Y. So, in the example, I would like to get ((x1 - y) (x2 - y) (x3 - y))/(Number of X's). This gets a little complicated because of the missing values, not all rows will be calculated the same. For example, row 1 the value would compute ((x1 - y) (x2-y))/2, and row 7 it should just compute (x2 - y)/1, because there is only one value. How can I get this column? Let me know if you need more clarification.

CodePudding user response:

You can do this with rowMeans in the wide format, to take care of the NA issues:

rowMeans(df[c("x1","x2","x3")] - df[["y"]], na.rm=TRUE)

Ultimately though, I think this sort of x1/2/3 data is better off in a long-format with no NAs, so that you can do any calculations you want using group_by like logic without having to worry about excluding rows. Something like:

na.omit(cbind(rowid = seq_len(nrow(df)), stack(df[-1])))
#   rowid     values ind
#1      1 0.95683335  x1
#3      3 0.67757064  x1
#4      4 0.57263340  x1
#6      6 0.89982497  x1
#10    10 0.95450365  x1
#11     1 0.88953932  x2
#12     2 0.69280341  x2
# ...

CodePudding user response:

You could do:

library(dplyr)

df %>%
  rowwise() %>%
  mutate(result = mean(c_across(x1:x3) - y, na.rm = TRUE))

# A tibble: 10 × 5
# Rowwise: 
        y     x1    x2      x3  result
    <dbl>  <dbl> <dbl>   <dbl>   <dbl>
 1 0.288   0.957 0.890 NA       0.636 
 2 0.788  NA     0.693 NA      -0.0955
 3 0.409   0.678 0.641  0.691   0.261 
 4 0.883   0.573 0.994 NA      -0.0996
 5 0.940  NA     0.656  0.0246 -0.600 
 6 0.0456  0.900 0.709  0.478   0.650 
 7 0.528  NA     0.544 NA       0.0160
 8 0.892  NA     0.594  0.216  -0.487 
 9 0.551  NA     0.289  0.318  -0.248 
10 0.457   0.955 0.147  0.232  -0.0122

CodePudding user response:

Another solution using pmean from kit package:

df["result"] = kit::pmean(df[c("x1", "x2", "x3")]-df[["y"]], na.rm=TRUE)

           y        x1        x2         x3      result
1  0.2875775 0.9568333 0.8895393         NA  0.63560881
2  0.7883051        NA 0.6928034         NA -0.09550173
3  0.4089769 0.6775706 0.6405068 0.69070528  0.26061732
4  0.8830174 0.5726334 0.9942698         NA -0.09956581
5  0.9404673        NA 0.6557058 0.02461368 -0.60030754
6  0.0455565 0.8998250 0.7085305 0.47779597  0.64982730
7  0.5281055        NA 0.5440660         NA  0.01596054
8  0.8924190        NA 0.5941420 0.21640794 -0.48714407
9  0.5514350        NA 0.2891597 0.31818101 -0.24776464
10 0.4566147 0.9545036 0.1471136 0.23162579 -0.01220037

CodePudding user response:

data.table solution

setDT(df)

df[, result := rowMeans(.SD, na.rm = T) - y, .SDcols = 2:4]

results

           y      x1      x2       x3    result
 1: 0.287578 0.95683 0.88954       NA  0.635609
 2: 0.788305      NA 0.69280       NA -0.095502
 3: 0.408977 0.67757 0.64051 0.690705  0.260617
 4: 0.883017 0.57263 0.99427       NA -0.099566
 5: 0.940467      NA 0.65571 0.024614 -0.600308
 6: 0.045556 0.89982 0.70853 0.477796  0.649827
 7: 0.528105      NA 0.54407       NA  0.015961
 8: 0.892419      NA 0.59414 0.216408 -0.487144
 9: 0.551435      NA 0.28916 0.318181 -0.247765
10: 0.456615 0.95450 0.14711 0.231626 -0.012200
  • Related