Home > Blockchain >  Selecting id's that has at least one observation pre and post in R
Selecting id's that has at least one observation pre and post in R

Time:11-02

I am trying to filter the ID's that has at least one observation in both pre period and post period each. For example, I select ID 1 as that has pre1,2 and post 1, also ID 3 for the same reason. But, ID2,5 does not have post period observations and ID4 does not have pre period observations so those ID's(2, 4, and 5) are dropped.

I am currently divide the sample into pre and post parts and merge them by intersection using merge(pre, post, by='ID', all=FALSE). However, it gives me wrong result so I was wondering if there is other ways to do this. I would appreciate it if you help me with this. Thank you!

ID Year
1 pre1
1 pre2
1 post1
2 pre2
2 pre3
3 pre1
3 post1
4 post2
4 post3
5 pre1
5 pre2

CodePudding user response:

This is one way to do it using a tidy method to build a list of ID values:

ids<-data%>%
  mutate(prefix=str_remove_all(Year, "[:digit:]"))%>% #remove numbers
  select(ID, prefix)%>% #Get rid of Year column
  distinct()%>% #de duplicate to one row for each ID Year type
  group_by(ID)%>% #group by ID
  summarise(frequency=n())%>% #count how many rows there are
  filter(frequency>1)%>% #if 2 rows 1 is post, one is pre
  select(ID)%>%  #select only ID column
  as_vector() #create a vector of ID's meeting requirements

data%>%filter(ID %in% ids) #subset data based on ID's above

[![enter image description here][1]][1]

This looks like a lot of code, but after you figure out how to get the numbers out of the Year column it is simply tidyverse work. [1]: https://i.stack.imgur.com/zDiRc.png

CodePudding user response:

We can remove the digits with str_remove and filter the number of distinct elements after grouping by 'ID'

library(dplyr)
library(stringr)
df1 %>% 
   group_by(ID) %>%
   filter(n_distinct(str_remove(Year, "\\d ")) > 1) %>%
   ungroup

-output

# A tibble: 5 × 2
     ID Year 
  <int> <chr>
1     1 pre1 
2     1 pre2 
3     1 post1
4     3 pre1 
5     3 post1

Or another option is to use str_detect

df1 %>%
    group_by(ID) %>%
    filter(any(str_detect(Year, 'pre')), any(str_detect(Year, 'post'))) %>%
    ungroup

-output

# A tibble: 5 × 2
     ID Year 
  <int> <chr>
1     1 pre1 
2     1 pre2 
3     1 post1
4     3 pre1 
5     3 post1

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 
5L), Year = c("pre1", "pre2", "post1", "pre2", "pre3", "pre1", 
"post1", "post2", "post3", "pre1", "pre2")), 
class = "data.frame", row.names = c(NA, 
-11L))
  • Related