Home > front end >  R dplyr filter direct and indirect sequences of string
R dplyr filter direct and indirect sequences of string

Time:12-14

Assume I have the table below. And I want to find all id where A is directly or indirectly followed by B.

A->B is the direct sequence for id 1, while A->B is the indirect sequence for id 2, id 3, and id 4.

have <- tibble(id=c(rep(1,2),rep(2,4),rep(3,3),rep(4,4))
               ,sequence=c('A','B','A','D','C','B','D','A','C','B','D','A','B'))

have
# A tibble: 13 × 2
      id sequence
   <dbl> <chr>   
 1     1 A       
 2     1 B       
 3     2 A       
 4     2 D       
 5     2 C       
 6     2 B       
 7     3 D       
 8     3 A       
 9     3 C       
10     4 B       
11     4 D       
12     4 A       
13     4 B  

For A->B direct sequences, I do the following. But I don't think the same logic applies to indirect sequences unless I use regex expressions on the concatenated.

have %>% group_by(id) %>% 
  dplyr::mutate(process_seq = paste(lag(sequence), '->', sequence)) %>% 
  dplyr::filter(process_seq == 'A -> B')

CodePudding user response:

Here‘s one approach:

library(tidyverse)
have %>%
  group_by(id) %>%
  mutate(direct = if_else(sequence == 'A' & lead(sequence) == 'B', 1, 0)) %>%
  mutate(indirect = any(sequence == 'A') & any(sequence == 'B')) %>%
  filter(any(direct == 1) | indirect == TRUE) %>%
  ungroup()

which gives:

# A tibble: 10 x 4
      id sequence direct indirect
   <dbl> <chr>     <dbl> <lgl>   
 1     1 A             1 TRUE    
 2     1 B             0 TRUE    
 3     2 A             0 TRUE    
 4     2 D             0 TRUE    
 5     2 C             0 TRUE    
 6     2 B             0 TRUE    
 7     4 B             0 TRUE    
 8     4 D             0 TRUE    
 9     4 A             1 TRUE    
10     4 B             0 TRUE    

You can of course deselect the created direct /indirect columns.

CodePudding user response:

ABs = have %>% 
    group_by(id) %>% 
    mutate(rownum = row_number(),
           letternum = match(sequence, LETTERS[1:26])) %>% 
    filter(sequence == "A" | sequence == "B") %>% 
    mutate(dif_row = rownum - lag(rownum),
           dif_let = letternum - lag(letternum)) %>% 
    filter(!is.na(dif_row)) %>% 
    mutate(has_direct_link   = max(dif_row==1),
           has_indirect_link = max(dif_row >1 & dif_let == 1),
           has_reverse_link  = max(dif_row >1 & dif_let < 0)) %>% 
    select(id, starts_with("has_")) %>% 
    distinct()
    

res = have %>% left_join(ABs) %>% 
        mutate(has_no_link = as.integer(is.na(has_direct_link))) %>% 
        mutate_if(is.numeric,coalesce,0)

> res
# A tibble: 13 x 6
      id sequence has_direct_link has_indirect_link has_reverse_link has_no_link
   <dbl> <chr>              <dbl>             <dbl>            <dbl>       <dbl>
 1     1 A                      1                 0                0           0
 2     1 B                      1                 0                0           0
 3     2 A                      0                 1                0           0
 4     2 D                      0                 1                0           0
 5     2 C                      0                 1                0           0
 6     2 B                      0                 1                0           0
 7     3 D                      0                 0                0           1
 8     3 A                      0                 0                0           1
 9     3 C                      0                 0                0           1
10     4 B                      1                 0                1           0
11     4 D                      1                 0                1           0
12     4 A                      1                 0                1           0
13     4 B                      1                 0                1           0

@deschen's answer above is elegant, but somewhat incomplete: I don't think the "definition" of an indirect link is correct (every direct link is also an indirect link). But my answer could probably be improved by deschen.

  • Related