Home > front end >  In R, how to get subset by id with column condition
In R, how to get subset by id with column condition

Time:04-20

I have the following data frame.

id <- c(1,1,1,2,2,2,3,3,3,4,4,4)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
event1 <- c(0,0,1,0,0,1,0,0,0,1,0,0)
event2 <- c(0,1,0,0,0,0,0,1,0,1,0,0)   
event3 <- c(0,0,0,0,0,0,0,1,0,1,0,0)

temp<-data.frame(id,time,event1,event2,event3)


    id time event1 event2 event3
1   1    1      0      0      0
2   1    2      0      1      0
3   1    3      1      0      0
4   2    1      0      0      0
5   2    2      0      0      0
6   2    3      1      0      0
7   3    1      0      0      0
8   3    2      0      1      1
9   3    3      0      0      0
10  4    1      1      1      1
11  4    2      0      0      0
12  4    3      0      0      0

I wish to get a subset that each unique id and the event1, event2, event3 columns if that event happened (event = 1)

So that the output would be like this:

    id event1 event2 event3
1   1    1      1      0
2   2    1      0      0
3   3    0      1      1
4   4    1      1      1

I tried to remove duplicates and keep the unique id but that gave me the wrong output since event1 through event3 might not happened in time = 1. The time doesn't matter. As long as the event happens in one of the three time it counts.

I was trying to replace the 0s with 1 by id if 1 exists in that id for that event but this didn't work.

temp %>% group_by(id) %>% mutate(id, event1=ifelse(event1==1,1,event1),
                                 event2 = ifelse(event2==1,1,event2),
                                 event3 =ifelse(event3 ==1,1,event3))

Any help would be appreciated.

library(tidyverse)
id <- c(1,1,1,2,2,2,3,3,3,4,4,4)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
event1 <- as.factor(c(0,0,1,0,0,1,0,0,0,1,0,0))

event2 <-as.factor(c(0,1,0,0,0,0,0,1,0,1,0,0))

event3 <- as.factor(c(0,0,0,0,0,0,0,1,0,1,0,0))

temp<-data.frame(id,time,event1,event2,event3)

temp[,3:5][temp[,3:5]==0] <-NA
temp2 <- as.data.frame(temp%>% group_by(id)%>% 
                         fill(event1,event2,event3,.direction ="downup")
                       )
temp3 <- temp2[!duplicated(temp2[,'id']),]

temp3[is.na(temp3)] <-0
temp3

This gave my desired result but I feel like it's overcomplicated.

CodePudding user response:

You can use summarise() across() in dplyr:

library(dplyr)

temp %>%
  group_by(id) %>%
  summarise(across(contains("event"), sum))

The method above calculates the count of each event. If you just want to know whether an event happened or not, you can replace the sum part with max:

temp %>%
  group_by(id) %>%
  summarise(across(contains("event"), max))

# A tibble: 4 x 4
     id event1 event2 event3
  <dbl>  <dbl>  <dbl>  <dbl>
1     1      1      1      0
2     2      1      0      0
3     3      0      1      1
4     4      1      1      1

Other choices

# 1
across(contains("event"), ~  any(.x == 1))
# 2
across(contains("event"), ~  (sum(.x) > 0))

(Tip: converts logical to binary integer)

CodePudding user response:

Here is a possible data.table option:

library(data.table)
dt <- as.data.table(temp)

dt[, lapply(.SD, max, na.rm = TRUE), by=.(id), .SDcols=patterns("event")]

Or can use aggregate from base R:

aggregate(cbind(event1, event2, event3) ~ id, data = temp, max, na.rm = TRUE)

Or the collapse package:

library(collapse)

collap(temp, event1   event2   event3 ~ id, fmax)

Output

   id event1 event2 event3
1:  1      1      1      0
2:  2      1      0      0
3:  3      0      1      1
4:  4      1      1      1
  • Related