Home > Back-end >  Count occasion measurement times in longitudinal data
Count occasion measurement times in longitudinal data

Time:01-11

I have a longitudal dataset, where the same subjects are measured at different occasions in time. For instance:

dd=data.frame(subject_id=c(1,1,1,2,2,2,3,3,4,5,6,7,8,8,9,9),income=c(rnorm(16,50000,250)))

I should write something able to tell me how many subjects have been counted only once, twice, three times,... In the example above, the number of subjects measured at only one occasion in time is 4, the number of subjects measured twice is 3,... That's my attempt for counting, for instance, how many subjects have been measured only twice:

library(dplyr)
s.two=dd %>%   group_by(subject_id) %>%   filter(n() == 2) %>%   ungroup()
length(s.two$subject_id)/2

But since I have very heterogenous clusters (ranging from 1 to 24 observations per subject), this implies that I should write planty of rows. Is there something more efficient I can do?

The objective is to have a summary of the size of the clusters (and the cluster is the subject_id). For instance, let say I have 1,000 clusters. I wanna know, how many of them are made up of subjects observed just once, twice... And so, 50 out of 1000 clusters are made up of subjects observed just one occasion in time ; 300 out of 1000 clusters are made up of subjects observed just two occasions in time... With this info, I shall construct a table to add in my report

CodePudding user response:

You should use summarize. After this you can still filter with filter(n == 2).

library(dplyr)

dd <- data.frame(
  subject_id = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 9),
  income = c(rnorm(16, 50000, 250))
)


dd |>
  group_by(subject_id) |>
  summarise(n = n())
#> # A tibble: 9 × 2
#>   subject_id     n
#>        <dbl> <int>
#> 1          1     3
#> 2          2     3
#> 3          3     2
#> 4          4     1
#> 5          5     1
#> 6          6     1
#> 7          7     1
#> 8          8     2
#> 9          9     2

If you use mutate instead of summarize and filter then, you will get

dd |>
  group_by(subject_id) |>
  mutate(n = n()) |> 
  filter(n ==2)

subject_id income     n
       <dbl>  <dbl> <int>
1          3 49675.     2
2          3 50306.     2
3          8 49879.     2
4          8 50202.     2
5          9 49783.     2
6          9 49834.     2

NEW EDIT Maybe you mean this:

dd |>
  group_by(subject_id) |> 
  summarise(n = n()) |> 
  mutate(info = glue::glue(
    'There are {n} times {subject_id} out of {max(subject_id)} groups')) |> 
  select(info)

# A tibble: 9 × 1
  info                               
  <glue>                             
1 There are 3 times 1 out of 9 groups
2 There are 3 times 2 out of 9 groups
3 There are 2 times 3 out of 9 groups
4 There are 1 times 4 out of 9 groups
5 There are 1 times 5 out of 9 groups
6 There are 1 times 6 out of 9 groups
7 There are 1 times 7 out of 9 groups
8 There are 2 times 8 out of 9 groups
9 There are 2 times 9 out of 9 groups

Next which would be @Ritchie Sacramento 's solution

dd |>
  group_by(subject_id) |> 
  summarise(no_of_occurences = n()) |> 
  count(no_of_occurences) 

# A tibble: 3 × 2
  no_of_occurences     n
             <int> <int>
1                1     4
2                2     3
3                3     2
  • Related