Standard deviation of average events per ID in R-CodePudding

Background

I've got this dataset d:

d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
                event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
                stringsAsFactors=FALSE)

It's got 2 people (IDs) in it, and they each have some events.

The problem

I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).

In other words I'm looking for something like this:

| Mean |  SD  |
|------|------|
| 4.00 | 2.83 |

What I've tried

I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:

d %>%
  group_by(ID) %>%
  summarise(event = length(event)) %>%
  summarise(ratio = mean(event))

# A tibble: 1 x 1
  ratio
  <dbl>
1     4

And here's the SD:

d %>%
  group_by(ID) %>%
  summarise(event = length(event)) %>%  
  summarise(sd = sd(event))

# A tibble: 1 x 1
     sd
  <dbl>
1  2.83

But I when I try to pipe them together like so...

d %>%
  group_by(ID) %>%
  summarise(event = length(event)) %>%
  summarise(ratio = mean(event)) %>%
  summarise(sd = sd(event))

... I get an error:

Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found

Any insight?

CodePudding user response：

You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.

library(dplyr)

d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
                event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
                stringsAsFactors=FALSE)

d %>%
  group_by(ID) %>%
  # the next summarise will be within ID
  summarise(event = length(event)) %>% 
  # this summarise is overall
  summarise(sd = sd(event),
            ratio = mean(event))

#> # A tibble: 1 × 2
#>      sd ratio
#>   <dbl> <dbl>
#> 1  2.83     4

The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:

d %>%
  count(ID) %>% 
  summarise(sd = sd(n),
            ratio = mean(n))

^{Created on 2022-05-25 by the reprex package (v2.0.1)}