How to filter only subjects observed more than once in panel data with R?-CodePudding

all. Hope this isn't a duplicate question, I couldn't find the same question when I searched. I am analyzing panel data with R now, and the data format is as follows.

pid wave edu marri rela age  apt sido dongy urban stat1 stat2 exer dep3 bmi  mmse
1 3122    1   2  <NA>   NA  NA <NA>   NA  <NA>  <NA>  <NA>  <NA> <NA> <NA>  NA <NA>
2 3122    1  NA     1   NA  NA <NA>   NA  <NA>  <NA>  <NA>  <NA> <NA> <NA>  NA <NA>
3 3122    1  NA  <NA>    3  NA <NA>   NA  <NA>  <NA>  <NA>  <NA> <NA> <NA>  NA <NA>
4 3122    1  NA  <NA>   NA  71 <NA>   NA  <NA>  <NA>  <NA>  <NA> <NA> <NA>  NA <NA>
5 3122    1  NA  <NA>   NA  NA    1   NA  <NA>  <NA>  <NA>  <NA> <NA> <NA>  NA <NA>
6 3122    1  NA  <NA>   NA  NA <NA>   11  <NA>  <NA>  <NA>  <NA> <NA> <NA>  NA <NA>

The data are repeated measurements, and there are many missing values. If only the observed values are left at every year, the loss of the number is large, so I want to select and analyze only subjects who have been measured more than once among the 'mmse' variables.

I tried to check the change of the variable of interest through the following code, but it didn't work.

df %>%
  arrange(pid, wave) %>%
  group_by(pid) %>%
  mutate(
    mmse_change = mmse - lag(mmse),
    mmse_increase = mmse_change > 0,
    mmse_decrease = mmse_change < 0
  )

I need the above object to analyze the baseline characteristic. Please help me on how to extract subjects with this condition.

CodePudding user response：

We could do something like this:

df %>%
  filter(!is.na(mmse)) %>%   # just keep rows with non-NA in mmse
  count(pid) %>%             # count how many observations per pid
  filter(n > 1) %>%          # keep those pid's appearing more than once
  select(pid) %>%            # just keep the pid column
  left_join(df)              # get `df` for just those pid's

CodePudding user response：

Another approach without join is to group_by(pid) and then filter all groups where max(row_number()) > 1.

Below I changed your initial data so that it can be used for this problem (your original data has only NAs in mmse and please put your data in reproducible code next).

library(tidyverse)

# initial data slightly changed:
df <- tribble(~pid, ~wave, ~edu, ~marri, ~rela, ~age, ~apt, ~sido, ~dongy, ~urban, ~stat1, ~stat2, ~exer, ~dep3, ~bmi,  ~mmse,
3122 ,   1,   2,  NA,   NA,  NA, NA,   NA,  NA,  NA,  NA,  NA, NA, NA,  NA, 1,
3122 ,   1,  NA,   1,   NA,  NA, NA,   NA,  NA,  NA,  NA,  NA, NA, NA,  NA, NA,
3122 ,   1,  NA,  NA,    3,  NA, NA,   NA,  NA,  NA,  NA,  NA, NA, NA,  NA, 2,
3122 ,   1,  NA,  NA,   NA,  71, NA,   NA,  NA,  NA,  NA,  NA, NA, NA,  NA, NA,
3122 ,   1,  NA,  NA,   NA,  NA,  1,   NA,  NA,  NA,  NA,  NA, NA, NA,  NA, 3,
3124 ,   1,  NA,  NA,   NA,  NA, NA,   11,  NA,  NA,  NA,  NA, NA, NA,  NA, 5)

df %>% 
  filter(!is.na(mmse)) %>%
  group_by(pid) %>% 
  filter(max(row_number()) > 1) %>% 
  ungroup()

#> # A tibble: 3 x 16
#>     pid  wave   edu marri  rela   age   apt  sido dongy urban stat1 stat2 exer 
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1  3122     1     2    NA    NA    NA    NA    NA NA    NA    NA    NA    NA   
#> 2  3122     1    NA    NA     3    NA    NA    NA NA    NA    NA    NA    NA   
#> 3  3122     1    NA    NA    NA    NA     1    NA NA    NA    NA    NA    NA   
#> # ... with 3 more variables: dep3 <lgl>, bmi <lgl>, mmse <dbl>

^{Created on 2022-09-21 by the reprex package (v2.0.1)}