Extract the data if the first row of each id is 1 using R-CodePudding

Here, I made a simple data to demonstrate what I want to do.

df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
               date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
               disease=c(0,1,0,1,1,1,0,1,1))

id stands for a personal id. disease=1 means that person has a disease. disease=0 means that person doesn't have a disease.There are 3 people in df.For id equals 1, the first row of the value of disease is 0. On the other hand, the first two rows of the value of disease for id 2 and 3 are 1. I want to extract the data if the first row of each id is 1. So, I should extract the data with id 2 and 3. My expected output is

df<-data.frame(id=c(2,2,2,2,3,3),
               date=c(20220514,20220517,20220518,20220519,20220613,20220618),
               disease=c(1,1,1,0,1,1))

CodePudding user response：

You can use a filter where you select the first row_number and condition you want per group_by with any to get the group like this:

df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
               date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
               disease=c(0,1,0,1,1,1,0,1,1))

library(dplyr)
df %>%
  group_by(id) %>% 
  filter(any(row_number() == 1 & disease == 1))
#> # A tibble: 6 × 3
#> # Groups:   id [2]
#>      id     date disease
#>   <dbl>    <dbl>   <dbl>
#> 1     2 20220514       1
#> 2     2 20220517       1
#> 3     2 20220518       1
#> 4     2 20220519       0
#> 5     3 20220613       1
#> 6     3 20220618       1

^{Created on 2022-07-25 by the reprex package (v2.0.1)}

If you only want to select the rows that meet your condition you can use this:

df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
               date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
               disease=c(0,1,0,1,1,1,0,1,1))

library(dplyr)
df %>%
  group_by(id) %>% 
  filter(row_number() == 1 & disease == 1)
#> # A tibble: 2 × 3
#> # Groups:   id [2]
#>      id     date disease
#>   <dbl>    <dbl>   <dbl>
#> 1     2 20220514       1
#> 2     3 20220613       1

^{Created on 2022-07-25 by the reprex package (v2.0.1)}

CodePudding user response：

We could also do like this:

library(dplyr)

df %>% 
  group_by(id) %>% 
  filter(first(disease)==1)

     id     date disease
  <dbl>    <dbl>   <dbl>
1     2 20220514       1
2     2 20220517       1
3     2 20220518       1
4     2 20220519       0
5     3 20220613       1
6     3 20220618       1

CodePudding user response：

In base R you can do:

ids_disease <- df$id[!duplicated(df$id) & df$disease == 1]
df[df$id %in% ids_disease, ]