Is it possible to count by using the count function within across()?-CodePudding

Hello R and tidyverse wizards,

I try to count the rows of the starwars data set to know how many observations we get with the variables "height" and "mass" . I managed to get it with this code:

library(tidyverse)

starwars %>%
  select(height, mass) %>%
  drop_na() %>%
  summarise(across(.cols = c(height, mass),
                   list(obs = ~ n(),
                        mean = mean,
                        sd = sd))) %>%
  View()

I would like to replace the obs = ~ n() by the count function and tried this version:

library(tidyverse)

starwars %>%
  select(height, mass) %>%
  drop_na() %>%
  summarise(across(.cols = c(height, mass),
                   list(obs = count,
                        mean = mean,
                        sd = sd))) %>%
  View()

but it was too simple to work, classic :p

I had this error message --> Error in View : Problem while computing ..1 = across(...)

And when I got rid of the View() function, I had another error message --> Error in summarise(): ! Problem while computing ..1 = across(...). Caused by error in across(): ! Problem while computing column height_obs. Caused by error in UseMethod(): ! no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"

So, I got two questions:

could someone please explain why the code worked with ~ n() but not with count?
is it possible to use the count function instead of ~ n() in that case?

Sorry if it is a dumb question but I just try to understand the across and the count functions by playing with it.

CodePudding user response：

In the function description it says that "df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())", so I assume that using count() within across results in something like a double summarize-command, hence the use in favor of n().

Edit: Here you find the solution in the comment by G. Grothendieck What is the difference between n() and count() in R? When should one favour the use of either or both?

n() returns a number
count() returns a dataframe

CodePudding user response：

count() takes a dataframe as its first argument. It then returns counts for columns within that dataframe, passed as additional arguments. e.g.,

library(dplyr)

count(starwars, mass, height)

When you put count() inside across(), it passes columns to count() without including the dataframe as the first argument. Equivalent to if you ran,

count(starwars$mass, starwars$height)

Because count() expects a dataframe as the first argument, it throws an error.

n(), on the other hand, doesn’t take any arguments, and simply counts rows in the current environment (or group). You have to include the ~, as otherwise it will try passing each column to n(), which causes an error since n() doesn’t expect arguments.