Home > database >  Summarize many variables with a function of two variables
Summarize many variables with a function of two variables

Time:10-03

The summarise_if function is very helpful to summary several variables. Assume that I need the mean of every numeric variable in my dataset. I can use

df <- as_tibble(iris)
df %>% summarise_if(is.numeric, .fun = mean)

This works perfectly. But assume now that the function in .fun involves 2 arguments from the dataset (an example is the weighet.mean, where the weight variable is Sepal.Length). I tried,

df %>% summarise_if(is.numeric, .fun = function(x, w) weighted.mean(x, w), w = Sepal.Length)

The error was

Error in list2(...) : object 'Sepal.Width' not found

I suspect that R did not search Sepal.Length in df but in it global environment. So I have to use,

df %>% summarise_if(is.numeric, .fun = function(x, w) weighted.mean(x, w), w = df$Sepal.Length)

This works but it is not a good to do df$Sepal.Length. For example, it becomes completely impossible for me to compute the weighted mean by group.

df %>% group_by(Species) %>% summarise_if(is.numeric, .fun = function(x, w) weighted.mean(x, w), w = df$Sepal.Length)

Error: Problem with summarise() column Sepal.Length. ℹ Sepal.Length = (function (x, w) .... x 'x' and 'w' must have the same length ℹ The error occurred in group 1: Species = setosa.

So, how to use summarise_if or summarise_at with functions involving two variables from the dataset.

CodePudding user response:

If we need to use Sepal.Length as w, concatenate (c) the output from where(is.numeric) and specify -Sepal.Length to remove the column from across, then use weighted.mean on the other numeric columns, with w as 'Sepal.Length'

library(dplyr)
df %>% 
   summarise(across(c(where(is.numeric), -Sepal.Length), 
        ~ weighted.mean(., w = Sepal.Length)))
# A tibble: 1 × 3
  Sepal.Width Petal.Length Petal.Width
        <dbl>        <dbl>       <dbl>
1        3.05         3.97        1.29

Or a grouped one would be

df %>%
   group_by(Species) %>% 
   summarise(across(c(where(is.numeric), -Sepal.Length), 
        ~ weighted.mean(., w = Sepal.Length)))

-output

# A tibble: 3 × 4
  Species    Sepal.Width Petal.Length Petal.Width
  <fct>            <dbl>        <dbl>       <dbl>
1 setosa            3.45         1.47       0.248
2 versicolor        2.78         4.29       1.34 
3 virginica         2.99         5.60       2.03 

NOTE: _if, _at, _all suffix functions are deprecated in favor for across

  • Related