Here, I made a simple data to demonstrate what I want to do.
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
id
stands for a personal id. disease
=1 means that person has a disease. disease
=0 means that person doesn't have a disease.There are 3 people in df
.For id
equals 1, the first row of the value of disease
is 0. On the other hand, the first two rows of the value of disease
for id
2 and 3 are 1. I want to extract the data if the first row of each id is 1.
So, I should extract the data with id
2 and 3. My expected output is
df<-data.frame(id=c(2,2,2,2,3,3),
date=c(20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(1,1,1,0,1,1))
CodePudding user response:
You can use a filter
where you select the first row_number
and condition you want per group_by
with any
to get the group like this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
library(dplyr)
df %>%
group_by(id) %>%
filter(any(row_number() == 1 & disease == 1))
#> # A tibble: 6 × 3
#> # Groups: id [2]
#> id date disease
#> <dbl> <dbl> <dbl>
#> 1 2 20220514 1
#> 2 2 20220517 1
#> 3 2 20220518 1
#> 4 2 20220519 0
#> 5 3 20220613 1
#> 6 3 20220618 1
Created on 2022-07-25 by the reprex package (v2.0.1)
If you only want to select the rows that meet your condition you can use this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() == 1 & disease == 1)
#> # A tibble: 2 × 3
#> # Groups: id [2]
#> id date disease
#> <dbl> <dbl> <dbl>
#> 1 2 20220514 1
#> 2 3 20220613 1
Created on 2022-07-25 by the reprex package (v2.0.1)
CodePudding user response:
We could also do like this:
library(dplyr)
df %>%
group_by(id) %>%
filter(first(disease)==1)
id date disease
<dbl> <dbl> <dbl>
1 2 20220514 1
2 2 20220517 1
3 2 20220518 1
4 2 20220519 0
5 3 20220613 1
6 3 20220618 1
CodePudding user response:
In base
R you can do:
ids_disease <- df$id[!duplicated(df$id) & df$disease == 1]
df[df$id %in% ids_disease, ]