Home > Enterprise >  Count number of observations per distinct group inside summarise with dplyr (n_distinct equivalent?)
Count number of observations per distinct group inside summarise with dplyr (n_distinct equivalent?)

Time:04-17

Is there a function that counts the number of observations within unique groups and not the number of distinct groups as n_distinct() does?

I'm summarising data with dplyr and group_by(), and I'm trying to calculate means of numbers of observations per a different grouping variable.

df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
               id.2=c('1', '2', '2', '1','1','1','2','2'),
               v=c(sample(1:10, 8)))

df%>%
  group_by(id.2)%>%
  summarise(n.mean=mean(n_distinct(id)),
            v.mean=mean(v))

# A tibble: 2 × 3
  id.2  n.mean v.mean
  <chr>  <dbl>  <dbl>
1 1          3    5  
2 2          2    4.5

What I instead need:

id.2  n.mean v.mean
1          1    5  
2          2    4.5

because for
id.2==1 n.mean is the mean of 1 observation for A, 2 for B, 1 observation for C,

> mean(1,2,1)
[1] 1

id.2==2 n.mean is the mean of 2 observations for A, 0 for B, 2 for C,

mean(2,0,2)
[1] 2

I tried grouping by group_by(id, id.2) first to count the observations and then pass those counts on when grouping by only id.2 in a subsequent step, but that didn't work (though I probably just don't know how to implement this with dplyr as I'm not very experienced with tidyverse solutions)

CodePudding user response:

You are not using mean correctly. mean(1, 2, 1) ignores all but the first argument and therefore will return 1 no matter what other numbers are in the second and third positions. For id.2 == 1, you'd want mean(c(1, 2, 1)), which returns 1.333.

We can use table to quickly calculate the frequencies of id within each grouping of id.2, and then take the mean of those. We can compute v.mean in the same step.

library(tidyverse)

df %>% 
  group_by(id.2) %>% 
  summarize(
    n.mean = mean(table(id)),
    v.mean = mean(v)
  ) 

  id.2  n.mean v.mean
  <chr>  <dbl>  <dbl>
1 1       1.33   4.25
2 2       2      6  

Your example notes that id.2 == 2 does not have any values for id == B. It is not clear whether your desired solution counts this as a zero-length category, or simply ignores it. The solution above ignores it. The following includes it as a zero-length category by first complete-ing the input data (note new row #7, which has NA data):

df_complete <- complete(df, id.2, id)

  id.2  id        v
  <chr> <chr> <int>
1 1     A         9
2 1     B         1
3 1     B         2
4 1     C         5
5 2     A         4
6 2     A         7
7 2     B        NA
8 2     C         3
9 2     C        10

We can convert id to factor data, which will force table to preserve its unique levels even in groupings of zero length:

df_complete %>% 
  group_by(id.2) %>% 
  mutate(id = factor(id)) %>% 
  filter(!is.na(v)) %>% 
  summarize(
    n.mean = mean(table(id)),
    v.mean = mean(v, na.rm = T)
  )

  id.2  n.mean v.mean
  <chr>  <dbl>  <dbl>
1 1       1.33   4.25
2 2       1.33   6  

Or an alternate recipe that does not rely on table:

df_complete %>% 
  group_by(id.2, id) %>% 
  summarize(
    n_rows = sum(!is.na(v)), 
    id_mean = mean(v)
  ) %>% 
  group_by(id.2) %>% 
  summarize(
    n.mean = mean(n_rows),
    v.mean = weighted.mean(id_mean, n_rows, na.rm = T)
  )

  id.2  n.mean v.mean
  <chr>  <dbl>  <dbl>
1 1       1.33   4.25
2 2       1.33   6   

Note that when providing randomized example data, you should use set.seed to control the randomization and ensure reproducibility. Here is what I used:

set.seed(0)
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
               id.2=c('1', '2', '2', '1','1','1','2','2'),
               v=c(sample(1:10, 8)))
  • Related