Why is my summarize function not working for certain years? R studio-CodePudding

This is my first time using R so I was given a starter code for the project. The summary function that was provided for the year 2000 is:

summaries <- data_south_west %>% group_by(Region) %>% 
  summarize(mean00 = mean(`2000`), sd00 = sd(`2000`), median00 = median(`2000`), iqr00 = IQR(`2000`))

I copied the function as instructed and edited the function:

summarize(mean05 = mean(`2005`), sd05 = sd(`2005`), median05 = median(`2005`), iqr05 = IQR(`2005`))

Why is the year 2000 findable, but 2005 isn't?

CodePudding user response：

The issue is that you code is syntactically wrong. First, to replicate your issue I use some random example data:

library(dplyr)

# Random example data
set.seed(123)

data_south_west <- data.frame(
  Region = sample(c("A", "B"), 100, replace = TRUE),
  "2000" = runif(100),
  "2005" = runif(100),
  check.names = FALSE
)

Next, running your code I get the same error you mentioned.

data_south_west %>% 
  group_by(Region) %>% 
  summarize(mean00 = mean(`2000`), sd00 = sd(`2000`), median00 = median(`2000`), iqr00 = IQR(`2000`))
#> # A tibble: 2 × 5
#>   Region mean00  sd00 median00 iqr00
#>   <chr>   <dbl> <dbl>    <dbl> <dbl>
#> 1 A       0.492 0.261    0.467 0.427
#> 2 B       0.543 0.268    0.572 0.391
  summarize(mean05 = mean(`2005`), sd05 = sd(`2005`), median05 = median(`2005`), iqr05 = IQR(`2005`))
#> Error in mean(`2005`): object '2005' not found

The reason is that to connect multiple summarise or other dplyr verbs you have to make use of the pipe operator %>% to pipe the result from one statement into the next. However, in your case simply piping will not fix the issue because after summarizing the result is a dataframe or tibble which no longer has a column named 2005.

Instead, if you want to summarise multiple columns you have to do so in one summarise call, i.e. do:

data_south_west %>% 
  group_by(Region) %>% 
  summarize(mean00 = mean(`2000`), sd00 = sd(`2000`), median00 = median(`2000`), iqr00 = IQR(`2000`),
            mean05 = mean(`2005`), sd05 = sd(`2005`), median05 = median(`2005`), iqr05 = IQR(`2005`))
#> # A tibble: 2 × 9
#>   Region mean00  sd00 median00 iqr00 mean05  sd05 median05 iqr05
#>   <chr>   <dbl> <dbl>    <dbl> <dbl>  <dbl> <dbl>    <dbl> <dbl>
#> 1 A       0.492 0.261    0.467 0.427  0.492 0.287    0.503 0.486
#> 2 B       0.543 0.268    0.572 0.391  0.478 0.307    0.472 0.536

However, a less verbose approach to achieve the same result would be to use dplyr::across which allows to apply the same set of functions to multiple data columns. Here I make use of the .name argument to name the columns after summarizing according to your example code.

data_south_west %>% 
  group_by(Region) %>% 
  summarize(across(c(`2000`, `2005`), 
                   .fns = list(mean = mean, sd = sd, median = median, iqr = IQR),
                   .names = "{.fn}{substr(.col, 3, 4)}"
  ))
#> # A tibble: 2 × 9
#>   Region mean00  sd00 median00 iqr00 mean05  sd05 median05 iqr05
#>   <chr>   <dbl> <dbl>    <dbl> <dbl>  <dbl> <dbl>    <dbl> <dbl>
#> 1 A       0.492 0.261    0.467 0.427  0.492 0.287    0.503 0.486
#> 2 B       0.543 0.268    0.572 0.391  0.478 0.307    0.472 0.536

CodePudding user response：

Your first command works while the second doesn't, because you didn't feed any data into summarize() in the second command.

The first example may span two lines, but it is one command:

summaries <- data_south_west %>% group_by(Region) %>% 
  summarize(mean00 = mean(`2000`), sd00 = sd(`2000`), median00 = median(`2000`), iqr00 = IQR(`2000`))

The different parts of the command are connected through the assignment operator <- and through pipes %>%.

Pipes in R

What a pipe does, is take the output from the code on one of its sides, and provide it as the first argument for the code on the other side. In the case of the %>% pipe, it takes the output from the left, and provides it as the first argument to the code on the right.

Try running the following example:

"hello" %>% paste("world") %>% print()

It is essentially the same code as print(paste("hello", "world").

The great benefit of using pipes is that you can use them to chain commands one after another, instead of inside of each other. This makes long commands easier to read, more intuitive to interpret (the computer executes them left to right instead of inside to outside), and easier to edit.

The %>% pipe comes from the magrittr package, and is what you'll see in use the most right now. Base R also introduced a pipe recently: since 4.1 you can use |>.

Back to Question

Explaining the working code

In this code:

summaries <- data_south_west %>% group_by(Region) %>% 
  summarize(mean00 = mean(`2000`), sd00 = sd(`2000`), median00 = median(`2000`), iqr00 = IQR(`2000`))

the first pipe takes the contents of data_south_west, feeds them into group_by() as its first argument, which then becomes group_by(data_south_west, Region).

The second pipe takes the result from group_by() and feeds it into summarize(), which becomes:

summarize(<the_result_of_group_by>, mean00 = mean(`2000`), sd00 = sd(`2000`), median00 = median(`2000`), iqr00 = IQR(`2000`))

Finally after summarize() is done, the result of that is put into summaries by <-.

Error

The following code:

summarize(mean05 = mean(`2005`), sd05 = sd(`2005`), median05 = median(`2005`), iqr05 = IQR(`2005`))

is different from the working code, as the first argument is lacking. You can either provide it in place, or through a pipe:

summarize(data_south_west, mean05 = mean(`2005`), sd05 = sd(`2005`), median05 = median(`2005`), iqr05 = IQR(`2005`))

data_south_west %>% summarize( mean05 = mean(`2005`), sd05 = sd(`2005`), median05 = median(`2005`), iqr05 = IQR(`2005`))