Home > Software engineering >  Select only participants with multiple rows
Select only participants with multiple rows

Time:05-30

ID score Time
1 1000000 1
2 1000000 1
2 1000000 2
3 1000000 1
3 1000000 2
4 1000000 1
5 1000000 1
5 1000000 2

How do I subset this database to a new data base including only ID's with at least 2 follow-ups? Follow-up is resembled as "Time", whereas people with only Time 1 had one follow-up and I want to exclude them from the database. Thanks.

CodePudding user response:

We can either use a frequency based on approach to filter the 'ID's having more than one observation after grouping by 'ID'

library(dplyr)
df1 %>%
   group_by(ID) %>%
   filter(n() > 1) %>%
   ungroup

Or in base R - also use subset to subset the rows where it checks for 'ID's that have Time value greater than 1

subset(df1, ID %in% ID[Time > 1])

data

df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 5L, 5L), score = c(1000000L, 
1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L
), Time = c(1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)), 
class = "data.frame", row.names = c(NA, 
-8L))

CodePudding user response:

Here is another option using dplyr, where we first count the number of observations for each ID, then filter to keep only the IDs with at least 2 observations, then remove the count column.

library(dplyr)

df %>% 
  add_count(ID, name = "obs") %>% 
  filter(obs > 1) %>% 
  select(-obs)

Output

  ID   score Time
1  2 1000000    1
2  2 1000000    2
3  3 1000000    1
4  3 1000000    2
5  5 1000000    1
6  5 1000000    2

Or another option using data.table:

library(data.table)

setDT(df)[,if(.N > 1) .SD, by=ID]

Data

df <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 5L, 5L), score = c(1000000L, 
1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L
), Time = c(1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)), class = "data.frame", row.names = c(NA, 
-8L))
  • Related