Is there a function that counts the number of observations within unique groups and not the number of distinct groups as n_distinct()
does?
I'm summarising data with dplyr
and group_by()
, and I'm trying to calculate means of numbers of observations per a different grouping variable.
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))
df%>%
group_by(id.2)%>%
summarise(n.mean=mean(n_distinct(id)),
v.mean=mean(v))
# A tibble: 2 × 3
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 3 5
2 2 2 4.5
What I instead need:
id.2 n.mean v.mean
1 1 5
2 2 4.5
because for
id.2==1 n.mean is the mean of 1 observation for A, 2 for B, 1 observation for C,
> mean(1,2,1)
[1] 1
id.2==2 n.mean is the mean of 2 observations for A, 0 for B, 2 for C,
mean(2,0,2)
[1] 2
I tried grouping by group_by(id, id.2)
first to count the observations and then pass those counts on when grouping by only id.2 in a subsequent step, but that didn't work (though I probably just don't know how to implement this with dplyr as I'm not very experienced with tidyverse solutions)
CodePudding user response:
You are not using mean
correctly. mean(1, 2, 1)
ignores all but the first argument and therefore will return 1 no matter what other numbers are in the second and third positions. For id.2 == 1
, you'd want mean(c(1, 2, 1))
, which returns 1.333.
We can use table
to quickly calculate the frequencies of id
within each grouping of id.2
, and then take the mean of those. We can compute v.mean
in the same step.
library(tidyverse)
df %>%
group_by(id.2) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 2 6
Your example notes that id.2 == 2
does not have any values for id == B
. It is not clear whether your desired solution counts this as a zero-length category, or simply ignores it. The solution above ignores it. The following includes it as a zero-length category by first complete
-ing the input data (note new row #7, which has NA data):
df_complete <- complete(df, id.2, id)
id.2 id v
<chr> <chr> <int>
1 1 A 9
2 1 B 1
3 1 B 2
4 1 C 5
5 2 A 4
6 2 A 7
7 2 B NA
8 2 C 3
9 2 C 10
We can convert id
to factor data, which will force table
to preserve its unique levels even in groupings of zero length:
df_complete %>%
group_by(id.2) %>%
mutate(id = factor(id)) %>%
filter(!is.na(v)) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Or an alternate recipe that does not rely on table
:
df_complete %>%
group_by(id.2, id) %>%
summarize(
n_rows = sum(!is.na(v)),
id_mean = mean(v)
) %>%
group_by(id.2) %>%
summarize(
n.mean = mean(n_rows),
v.mean = weighted.mean(id_mean, n_rows, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Note that when providing randomized example data, you should use set.seed
to control the randomization and ensure reproducibility. Here is what I used:
set.seed(0)
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))