I have a longitudal dataset, where the same subjects are measured at different occasions in time. For instance:
dd=data.frame(subject_id=c(1,1,1,2,2,2,3,3,4,5,6,7,8,8,9,9),income=c(rnorm(16,50000,250)))
I should write something able to tell me how many subjects have been counted only once, twice, three times,... In the example above, the number of subjects measured at only one occasion in time is 4, the number of subjects measured twice is 3,... That's my attempt for counting, for instance, how many subjects have been measured only twice:
library(dplyr)
s.two=dd %>% group_by(subject_id) %>% filter(n() == 2) %>% ungroup()
length(s.two$subject_id)/2
But since I have very heterogenous clusters (ranging from 1 to 24 observations per subject), this implies that I should write planty of rows. Is there something more efficient I can do?
The objective is to have a summary of the size of the clusters (and the cluster is the subject_id). For instance, let say I have 1,000 clusters. I wanna know, how many of them are made up of subjects observed just once, twice... And so, 50 out of 1000 clusters are made up of subjects observed just one occasion in time ; 300 out of 1000 clusters are made up of subjects observed just two occasions in time... With this info, I shall construct a table to add in my report
CodePudding user response:
You should use summarize
. After this you can still filter with filter(n == 2)
.
library(dplyr)
dd <- data.frame(
subject_id = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 9),
income = c(rnorm(16, 50000, 250))
)
dd |>
group_by(subject_id) |>
summarise(n = n())
#> # A tibble: 9 × 2
#> subject_id n
#> <dbl> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 2
#> 4 4 1
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 2
#> 9 9 2
If you use mutate
instead of summarize
and filter then, you will get
dd |>
group_by(subject_id) |>
mutate(n = n()) |>
filter(n ==2)
subject_id income n
<dbl> <dbl> <int>
1 3 49675. 2
2 3 50306. 2
3 8 49879. 2
4 8 50202. 2
5 9 49783. 2
6 9 49834. 2
NEW EDIT Maybe you mean this:
dd |>
group_by(subject_id) |>
summarise(n = n()) |>
mutate(info = glue::glue(
'There are {n} times {subject_id} out of {max(subject_id)} groups')) |>
select(info)
# A tibble: 9 × 1
info
<glue>
1 There are 3 times 1 out of 9 groups
2 There are 3 times 2 out of 9 groups
3 There are 2 times 3 out of 9 groups
4 There are 1 times 4 out of 9 groups
5 There are 1 times 5 out of 9 groups
6 There are 1 times 6 out of 9 groups
7 There are 1 times 7 out of 9 groups
8 There are 2 times 8 out of 9 groups
9 There are 2 times 9 out of 9 groups
Next which would be @Ritchie Sacramento
's solution
dd |>
group_by(subject_id) |>
summarise(no_of_occurences = n()) |>
count(no_of_occurences)
# A tibble: 3 × 2
no_of_occurences n
<int> <int>
1 1 4
2 2 3
3 3 2