Assume I have the table below. And I want to find all id
where A
is directly or indirectly followed by B
.
A->B
is the direct sequence for id 1
, while A->B
is the indirect sequence for id 2, id 3, and id 4.
have <- tibble(id=c(rep(1,2),rep(2,4),rep(3,3),rep(4,4))
,sequence=c('A','B','A','D','C','B','D','A','C','B','D','A','B'))
have
# A tibble: 13 × 2
id sequence
<dbl> <chr>
1 1 A
2 1 B
3 2 A
4 2 D
5 2 C
6 2 B
7 3 D
8 3 A
9 3 C
10 4 B
11 4 D
12 4 A
13 4 B
For A->B direct sequences
, I do the following. But I don't think the same logic applies to indirect sequences unless I use regex expressions on the concatenated.
have %>% group_by(id) %>%
dplyr::mutate(process_seq = paste(lag(sequence), '->', sequence)) %>%
dplyr::filter(process_seq == 'A -> B')
CodePudding user response:
Here‘s one approach:
library(tidyverse)
have %>%
group_by(id) %>%
mutate(direct = if_else(sequence == 'A' & lead(sequence) == 'B', 1, 0)) %>%
mutate(indirect = any(sequence == 'A') & any(sequence == 'B')) %>%
filter(any(direct == 1) | indirect == TRUE) %>%
ungroup()
which gives:
# A tibble: 10 x 4
id sequence direct indirect
<dbl> <chr> <dbl> <lgl>
1 1 A 1 TRUE
2 1 B 0 TRUE
3 2 A 0 TRUE
4 2 D 0 TRUE
5 2 C 0 TRUE
6 2 B 0 TRUE
7 4 B 0 TRUE
8 4 D 0 TRUE
9 4 A 1 TRUE
10 4 B 0 TRUE
You can of course deselect the created direct /indirect columns.
CodePudding user response:
ABs = have %>%
group_by(id) %>%
mutate(rownum = row_number(),
letternum = match(sequence, LETTERS[1:26])) %>%
filter(sequence == "A" | sequence == "B") %>%
mutate(dif_row = rownum - lag(rownum),
dif_let = letternum - lag(letternum)) %>%
filter(!is.na(dif_row)) %>%
mutate(has_direct_link = max(dif_row==1),
has_indirect_link = max(dif_row >1 & dif_let == 1),
has_reverse_link = max(dif_row >1 & dif_let < 0)) %>%
select(id, starts_with("has_")) %>%
distinct()
res = have %>% left_join(ABs) %>%
mutate(has_no_link = as.integer(is.na(has_direct_link))) %>%
mutate_if(is.numeric,coalesce,0)
> res
# A tibble: 13 x 6
id sequence has_direct_link has_indirect_link has_reverse_link has_no_link
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 1 0 0 0
2 1 B 1 0 0 0
3 2 A 0 1 0 0
4 2 D 0 1 0 0
5 2 C 0 1 0 0
6 2 B 0 1 0 0
7 3 D 0 0 0 1
8 3 A 0 0 0 1
9 3 C 0 0 0 1
10 4 B 1 0 1 0
11 4 D 1 0 1 0
12 4 A 1 0 1 0
13 4 B 1 0 1 0
@deschen's answer above is elegant, but somewhat incomplete: I don't think the "definition" of an indirect link is correct (every direct link is also an indirect link). But my answer could probably be improved by deschen.