I am trying to filter the ID's that has at least one observation in both pre period and post period each. For example, I select ID 1 as that has pre1,2 and post 1, also ID 3 for the same reason. But, ID2,5 does not have post period observations and ID4 does not have pre period observations so those ID's(2, 4, and 5) are dropped.
I am currently divide the sample into pre and post parts and merge them by intersection using merge(pre, post, by='ID', all=FALSE). However, it gives me wrong result so I was wondering if there is other ways to do this. I would appreciate it if you help me with this. Thank you!
ID | Year |
---|---|
1 | pre1 |
1 | pre2 |
1 | post1 |
2 | pre2 |
2 | pre3 |
3 | pre1 |
3 | post1 |
4 | post2 |
4 | post3 |
5 | pre1 |
5 | pre2 |
CodePudding user response:
This is one way to do it using a tidy method to build a list of ID values:
ids<-data%>%
mutate(prefix=str_remove_all(Year, "[:digit:]"))%>% #remove numbers
select(ID, prefix)%>% #Get rid of Year column
distinct()%>% #de duplicate to one row for each ID Year type
group_by(ID)%>% #group by ID
summarise(frequency=n())%>% #count how many rows there are
filter(frequency>1)%>% #if 2 rows 1 is post, one is pre
select(ID)%>% #select only ID column
as_vector() #create a vector of ID's meeting requirements
data%>%filter(ID %in% ids) #subset data based on ID's above
[![enter image description here][1]][1]
This looks like a lot of code, but after you figure out how to get the numbers out of the Year column it is simply tidyverse work. [1]: https://i.stack.imgur.com/zDiRc.png
CodePudding user response:
We can remove the digits with str_remove
and filter
the number of distinct elements after grouping by 'ID'
library(dplyr)
library(stringr)
df1 %>%
group_by(ID) %>%
filter(n_distinct(str_remove(Year, "\\d ")) > 1) %>%
ungroup
-output
# A tibble: 5 × 2
ID Year
<int> <chr>
1 1 pre1
2 1 pre2
3 1 post1
4 3 pre1
5 3 post1
Or another option is to use str_detect
df1 %>%
group_by(ID) %>%
filter(any(str_detect(Year, 'pre')), any(str_detect(Year, 'post'))) %>%
ungroup
-output
# A tibble: 5 × 2
ID Year
<int> <chr>
1 1 pre1
2 1 pre2
3 1 post1
4 3 pre1
5 3 post1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L,
5L), Year = c("pre1", "pre2", "post1", "pre2", "pre3", "pre1",
"post1", "post2", "post3", "pre1", "pre2")),
class = "data.frame", row.names = c(NA,
-11L))