all. Hope this isn't a duplicate question, I couldn't find the same question when I searched. I am analyzing panel data with R now, and the data format is as follows.
pid wave edu marri rela age apt sido dongy urban stat1 stat2 exer dep3 bmi mmse
1 3122 1 2 <NA> NA NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
2 3122 1 NA 1 NA NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
3 3122 1 NA <NA> 3 NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
4 3122 1 NA <NA> NA 71 <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
5 3122 1 NA <NA> NA NA 1 NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
6 3122 1 NA <NA> NA NA <NA> 11 <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
The data are repeated measurements, and there are many missing values. If only the observed values are left at every year, the loss of the number is large, so I want to select and analyze only subjects who have been measured more than once among the 'mmse' variables.
I tried to check the change of the variable of interest through the following code, but it didn't work.
df %>%
arrange(pid, wave) %>%
group_by(pid) %>%
mutate(
mmse_change = mmse - lag(mmse),
mmse_increase = mmse_change > 0,
mmse_decrease = mmse_change < 0
)
I need the above object to analyze the baseline characteristic. Please help me on how to extract subjects with this condition.
CodePudding user response:
We could do something like this:
df %>%
filter(!is.na(mmse)) %>% # just keep rows with non-NA in mmse
count(pid) %>% # count how many observations per pid
filter(n > 1) %>% # keep those pid's appearing more than once
select(pid) %>% # just keep the pid column
left_join(df) # get `df` for just those pid's
CodePudding user response:
Another approach without join is to group_by(pid)
and then filter
all groups where max(row_number()) > 1
.
Below I changed your initial data so that it can be used for this problem (your original data has only NA
s in mmse
and please put your data in reproducible code next).
library(tidyverse)
# initial data slightly changed:
df <- tribble(~pid, ~wave, ~edu, ~marri, ~rela, ~age, ~apt, ~sido, ~dongy, ~urban, ~stat1, ~stat2, ~exer, ~dep3, ~bmi, ~mmse,
3122 , 1, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1,
3122 , 1, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
3122 , 1, NA, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2,
3122 , 1, NA, NA, NA, 71, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
3122 , 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 3,
3124 , 1, NA, NA, NA, NA, NA, 11, NA, NA, NA, NA, NA, NA, NA, 5)
df %>%
filter(!is.na(mmse)) %>%
group_by(pid) %>%
filter(max(row_number()) > 1) %>%
ungroup()
#> # A tibble: 3 x 16
#> pid wave edu marri rela age apt sido dongy urban stat1 stat2 exer
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 3122 1 2 NA NA NA NA NA NA NA NA NA NA
#> 2 3122 1 NA NA 3 NA NA NA NA NA NA NA NA
#> 3 3122 1 NA NA NA NA 1 NA NA NA NA NA NA
#> # ... with 3 more variables: dep3 <lgl>, bmi <lgl>, mmse <dbl>
Created on 2022-09-21 by the reprex package (v2.0.1)