Home > Enterprise >  Find number of observations until a specific word is found
Find number of observations until a specific word is found

Time:10-12

Say I have the following data.table:

library(data.table)

DT <- data.table(
    ID = rep(c(1,2,3),4), 
    day = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)),
    Status = c(rep('A',3),'A','B','B','A','C','B','A','D','C')
)

What I would like to achieve is that for each ID, find number of observations (in this case if sorted by days, the number of day it takes to hit a specific Status. So if I need to do this for Status C, the result would be:

0 for ID 1 (since doesn't contain status C), 3 for ID 2, and 4 for ID 3.

The only way came to my mind was to write a function and do nested for loops, but I am sure there should be much better/faster/more efficient ways.

Appreciate any help.

CodePudding user response:

A possible data.table approach adding one column for the number of days to reach each status (0 if never reached):

library(data.table)

## status id's 
status_ids <- unique(DT$Status)
status_cols <- paste("status", status_ids, sep = "_")

## add one column for each status id
setorder(DT, ID, day)
DT[, (status_cols) := lapply(status_ids, \(s) ifelse(any(Status == s), min(day[Status == s]), 0)), by = "ID"]
     
DT
#>     ID day Status status_A status_B status_C status_D
#>  1:  1   1      A        1        0        0        0
#>  2:  1   2      A        1        0        0        0
#>  3:  1   3      A        1        0        0        0
#>  4:  1   4      A        1        0        0        0
#>  5:  2   1      A        1        2        3        4
#>  6:  2   2      B        1        2        3        4
#>  7:  2   3      C        1        2        3        4
#>  8:  2   4      D        1        2        3        4
#>  9:  3   1      A        1        2        4        0
#> 10:  3   2      B        1        2        4        0
#> 11:  3   3      B        1        2        4        0
#> 12:  3   4      C        1        2        4        0

CodePudding user response:

You can split by ID and return the first match of day.

sapply(split(DT[,2:3], DT$ID), \(x) x$day[match("C", x$Status)])
# 1  2  3 
#NA  3  4 

CodePudding user response:

Does this work:

library(dplyr)

DT %>% left_join(
  DT %>% group_by(ID) %>% summarise(col = row_number()[Status == 'C'])
) %>% replace_na(list(col= 0))
`summarise()` has grouped output by 'ID'. You can override using the
`.groups` argument.
Joining, by = "ID"
    ID day Status col
 1:  1   1      A   0
 2:  2   1      A   3
 3:  3   1      A   4
 4:  1   2      A   0
 5:  2   2      B   3
 6:  3   2      B   4
 7:  1   3      A   0
 8:  2   3      C   3
 9:  3   3      B   4
10:  1   4      A   0
11:  2   4      D   3
12:  3   4      C   4
  • Related